Understanding Adafactor%3A Adaptive Learning Rates with Sublinear Memory Cost

AdaFactor is a stochastic optimization method that reduces memory usage while maintaining the empirical benefits of adaptivity. It proposes a factored representation of the squared gradient accumulator, tracking only the row and column sums of the moving averages of squared gradients for matrix-valued variables. This approach reduces memory requirements from \(O(nm)\) to \(O(n + m)\) for an \(n \times m\) matrix. The method is demonstrated to achieve comparable performance to Adam on a large-scale machine translation task, using significantly less auxiliary storage. The paper also addresses issues with Adam, such as instability caused by outdated second-moment estimators and larger-than-desired updates. AdaFactor introduces update clipping and a gradually increasing decay rate scheme to mitigate these issues. Additionally, it proposes scaling parameter updates relative to the scale of the parameters themselves, which is shown to be more resilient to different initialization and scaling schemes. Experiments on the Transformer model on the WMT 2014 English-German translation task show that AdaFactor achieves similar BLEU scores to Adam with momentum, using much less memory. The method is also shown to be more stable without momentum and to perform well with relative step sizes.AdaFactor is a stochastic optimization method that reduces memory usage while maintaining the empirical benefits of adaptivity. It proposes a factored representation of the squared gradient accumulator, tracking only the row and column sums of the moving averages of squared gradients for matrix-valued variables. This approach reduces memory requirements from \(O(nm)\) to \(O(n + m)\) for an \(n \times m\) matrix. The method is demonstrated to achieve comparable performance to Adam on a large-scale machine translation task, using significantly less auxiliary storage. The paper also addresses issues with Adam, such as instability caused by outdated second-moment estimators and larger-than-desired updates. AdaFactor introduces update clipping and a gradually increasing decay rate scheme to mitigate these issues. Additionally, it proposes scaling parameter updates relative to the scale of the parameters themselves, which is shown to be more resilient to different initialization and scaling schemes. Experiments on the Transformer model on the WMT 2014 English-German translation task show that AdaFactor achieves similar BLEU scores to Adam with momentum, using much less memory. The method is also shown to be more stable without momentum and to perform well with relative step sizes.

AdaFactor: Adaptive Learning Rates with Sublinear Memory Cost

2018 | Noam Shazeer, Mitchell Stern