LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES

LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES

3 Jan 2020 | Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh
This paper introduces LAMB, a new layerwise adaptive large batch optimization algorithm that significantly improves the training efficiency of deep neural networks, particularly for large models like BERT. The key idea is to adaptively adjust the learning rate layer by layer, which allows for larger batch sizes without degrading performance. The algorithm is shown to achieve superior performance compared to existing methods such as LARS and ADAM, especially in large batch settings. The authors demonstrate that LAMB can train BERT in just 76 minutes using a batch size of 32,868, which is much faster than the typical 3 days required with smaller batch sizes. This improvement is achieved by normalizing gradients layer-wise and scaling the learning rate based on the norm of the parameters. The algorithm also provides theoretical guarantees for convergence in nonconvex settings. In addition to BERT, the paper evaluates LAMB on other tasks, including training ResNet-50 on ImageNet. The results show that LAMB achieves state-of-the-art accuracy on these tasks, outperforming other optimizers like ADAMW and LARS. The algorithm is also shown to be effective in both small and large batch settings, making it a versatile tool for deep learning. The paper highlights the importance of large batch training in accelerating deep learning model training. By using LAMB, the authors demonstrate that it is possible to significantly reduce training time without sacrificing performance. This has important implications for the practical deployment of large-scale deep learning models.This paper introduces LAMB, a new layerwise adaptive large batch optimization algorithm that significantly improves the training efficiency of deep neural networks, particularly for large models like BERT. The key idea is to adaptively adjust the learning rate layer by layer, which allows for larger batch sizes without degrading performance. The algorithm is shown to achieve superior performance compared to existing methods such as LARS and ADAM, especially in large batch settings. The authors demonstrate that LAMB can train BERT in just 76 minutes using a batch size of 32,868, which is much faster than the typical 3 days required with smaller batch sizes. This improvement is achieved by normalizing gradients layer-wise and scaling the learning rate based on the norm of the parameters. The algorithm also provides theoretical guarantees for convergence in nonconvex settings. In addition to BERT, the paper evaluates LAMB on other tasks, including training ResNet-50 on ImageNet. The results show that LAMB achieves state-of-the-art accuracy on these tasks, outperforming other optimizers like ADAMW and LARS. The algorithm is also shown to be effective in both small and large batch settings, making it a versatile tool for deep learning. The paper highlights the importance of large batch training in accelerating deep learning model training. By using LAMB, the authors demonstrate that it is possible to significantly reduce training time without sacrificing performance. This has important implications for the practical deployment of large-scale deep learning models.
Reach us at info@study.space