Adafactor is an adaptive learning rate method that reduces memory usage while maintaining the empirical benefits of adaptivity. It achieves this by maintaining a factored representation of the squared gradient accumulator, reducing memory requirements from O(nm) to O(n + m) for an n × m matrix. This approach allows for efficient computation of second-moment estimates using row and column sums of the moving averages, which are then used to reconstruct a low-rank approximation of the exponentially smoothed accumulator. This method is shown to produce results comparable to full accumulators in large-scale machine translation tasks.
Adafactor also addresses issues with the Adam optimizer, such as the instability caused by out-of-date second moment estimators. To mitigate this, the paper proposes update clipping, which scales down parameter updates that are larger than desired. Additionally, a gradually increasing decay rate scheme is introduced to improve stability. By combining these methods and dropping momentum, Adafactor achieves results comparable to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage.
The paper also introduces a relative step size approach, where the optimization algorithm is defined in terms of relative step sizes scaled by the parameter's scale. This approach is more resilient to different parameter initialization and scaling schemes. Experiments show that Adafactor performs well in terms of both convergence and stability, with results comparable to Adam with momentum. The proposed hyperparameters for Adafactor are provided, and the algorithm is implemented for both weight matrices and vectors. The code for Adafactor is available in the open-source Tensor2Tensor library.Adafactor is an adaptive learning rate method that reduces memory usage while maintaining the empirical benefits of adaptivity. It achieves this by maintaining a factored representation of the squared gradient accumulator, reducing memory requirements from O(nm) to O(n + m) for an n × m matrix. This approach allows for efficient computation of second-moment estimates using row and column sums of the moving averages, which are then used to reconstruct a low-rank approximation of the exponentially smoothed accumulator. This method is shown to produce results comparable to full accumulators in large-scale machine translation tasks.
Adafactor also addresses issues with the Adam optimizer, such as the instability caused by out-of-date second moment estimators. To mitigate this, the paper proposes update clipping, which scales down parameter updates that are larger than desired. Additionally, a gradually increasing decay rate scheme is introduced to improve stability. By combining these methods and dropping momentum, Adafactor achieves results comparable to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage.
The paper also introduces a relative step size approach, where the optimization algorithm is defined in terms of relative step sizes scaled by the parameter's scale. This approach is more resilient to different parameter initialization and scaling schemes. Experiments show that Adafactor performs well in terms of both convergence and stability, with results comparable to Adam with momentum. The proposed hyperparameters for Adafactor are provided, and the algorithm is implemented for both weight matrices and vectors. The code for Adafactor is available in the open-source Tensor2Tensor library.