This April 2018 paper introduces Adafactor, a novel optimization method designed to reduce the memory footprint of adaptive learning rate algorithms like Adam, particularly for large neural networks. Adafactor achieves this by estimating per-parameter second moments using factored representations, specifically maintaining only row and column sums for weight matrices, thereby reducing memory requirements from O(nm) to O(n+m). The paper also addresses training instability in adaptive methods, proposing update clipping and a gradually increasing decay rate scheme for the second-moment accumulator as solutions. Furthermore, Adafactor suggests scaling parameter updates based on the parameters' own magnitudes rather than absolute step sizes, contributing to its overall efficiency and stability. Experimental results on the Transformer model for machine translation demonstrate that Adafactor achieves comparable performance to Adam while requiring significantly less auxiliary memory.
Source:
https://arxiv.org/pdf/1804.04235