[slides and audio] Large Batch Optimization for Deep Learning%3A Training BERT in 76 minutes

This paper addresses the computational challenge of training large deep neural networks on massive datasets by introducing a new optimization technique called LAMB (Large Batch Optimization for Deep Learning). The authors first study a principled layerwise adaptation strategy to accelerate training using large mini-batches. They then develop LAMB, a layerwise adaptive large batch optimization technique, and provide convergence analysis for both LAMB and LARS (a similar but less effective algorithm). Empirical results demonstrate that LAMB outperforms existing optimizers across various tasks, including BERT and ResNet-50 training, with minimal hyperparameter tuning. Notably, LAMB enables the use of very large batch sizes, reducing BERT training time from 3 days to just 76 minutes. The paper also discusses the efficiency of LAMB for training state-of-the-art image classification models like ResNet-50.This paper addresses the computational challenge of training large deep neural networks on massive datasets by introducing a new optimization technique called LAMB (Large Batch Optimization for Deep Learning). The authors first study a principled layerwise adaptation strategy to accelerate training using large mini-batches. They then develop LAMB, a layerwise adaptive large batch optimization technique, and provide convergence analysis for both LAMB and LARS (a similar but less effective algorithm). Empirical results demonstrate that LAMB outperforms existing optimizers across various tasks, including BERT and ResNet-50 training, with minimal hyperparameter tuning. Notably, LAMB enables the use of very large batch sizes, reducing BERT training time from 3 days to just 76 minutes. The paper also discusses the efficiency of LAMB for training state-of-the-art image classification models like ResNet-50.

LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES

3 Jan 2020 | Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh