3 Jul 2024 | Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun
Adam-mini is a new optimizer that achieves performance comparable or better than AdamW with significantly reduced memory usage, up to 45-50%. It reduces memory by minimizing the learning rate resources used in Adam, specifically by using a single learning rate for each parameter block, determined based on the structure of the Hessian matrix. This approach allows Adam-mini to achieve performance on par with AdamW on various language models, including pre-training, supervised fine-tuning, and reinforcement learning from human feedback (RLHF). Adam-mini also improves throughput by reducing communication overhead among GPUs, leading to faster training times. It is particularly effective for large models like Llama2-7B, achieving 49.6% higher throughput than AdamW on 2× A800-80GB GPUs, which saves 33% wall-clock time for pre-training. Adam-mini is lightweight, effective, and efficient, with a design that allows it to use fewer learning rates while maintaining performance. It is also compatible with other optimization techniques, such as GaLore and Sophia, to further reduce memory usage. The key principle behind Adam-mini is the partitioning of model parameters based on the structure of the Hessian matrix, which allows for more efficient learning rate assignment. This approach is particularly effective for Transformers, where the Hessian matrix has a near-block-diagonal structure. Adam-mini is a promising alternative to AdamW, offering significant memory savings without compromising performance.Adam-mini is a new optimizer that achieves performance comparable or better than AdamW with significantly reduced memory usage, up to 45-50%. It reduces memory by minimizing the learning rate resources used in Adam, specifically by using a single learning rate for each parameter block, determined based on the structure of the Hessian matrix. This approach allows Adam-mini to achieve performance on par with AdamW on various language models, including pre-training, supervised fine-tuning, and reinforcement learning from human feedback (RLHF). Adam-mini also improves throughput by reducing communication overhead among GPUs, leading to faster training times. It is particularly effective for large models like Llama2-7B, achieving 49.6% higher throughput than AdamW on 2× A800-80GB GPUs, which saves 33% wall-clock time for pre-training. Adam-mini is lightweight, effective, and efficient, with a design that allows it to use fewer learning rates while maintaining performance. It is also compatible with other optimization techniques, such as GaLore and Sophia, to further reduce memory usage. The key principle behind Adam-mini is the partitioning of model parameters based on the structure of the Hessian matrix, which allows for more efficient learning rate assignment. This approach is particularly effective for Transformers, where the Hessian matrix has a near-block-diagonal structure. Adam-mini is a promising alternative to AdamW, offering significant memory savings without compromising performance.