3 Jul 2024 | Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun
The paper introduces Adam-mini, an optimized version of the AdamW algorithm that reduces memory usage by 45% to 50% while maintaining or improving performance. The key innovation is the reduction of learning rate resources in Adam by partitioning parameters into blocks based on the Hessian structure and assigning a single learning rate to each block. This approach leverages the near-block-diagonal structure of the Hessian in Transformers, where each block has a different eigenvalue distribution. The authors find that a single high-quality learning rate can outperform Adam for each block, provided sufficient resources are available to search for it. Adam-mini is evaluated on various language models, including pre-training, supervised fine-tuning, and RLHF, showing superior performance and higher throughput compared to AdamW, especially on limited hardware resources. The reduced memory footprint also reduces communication overhead among GPUs, further enhancing throughput. The paper discusses the design principles, experimental results, and potential future improvements, highlighting the effectiveness and efficiency of Adam-mini in training large language models.The paper introduces Adam-mini, an optimized version of the AdamW algorithm that reduces memory usage by 45% to 50% while maintaining or improving performance. The key innovation is the reduction of learning rate resources in Adam by partitioning parameters into blocks based on the Hessian structure and assigning a single learning rate to each block. This approach leverages the near-block-diagonal structure of the Hessian in Transformers, where each block has a different eigenvalue distribution. The authors find that a single high-quality learning rate can outperform Adam for each block, provided sufficient resources are available to search for it. Adam-mini is evaluated on various language models, including pre-training, supervised fine-tuning, and RLHF, showing superior performance and higher throughput compared to AdamW, especially on limited hardware resources. The reduced memory footprint also reduces communication overhead among GPUs, further enhancing throughput. The paper discusses the design principles, experimental results, and potential future improvements, highlighting the effectiveness and efficiency of Adam-mini in training large language models.