Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

30 Apr 2018 | Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He
This paper presents a method to train deep neural networks on the ImageNet dataset using large minibatch sizes, achieving high accuracy in just one hour with 256 GPUs. The key techniques include a hyperparameter-free linear scaling rule for learning rates and a warmup strategy to overcome optimization challenges. The linear scaling rule adjusts the learning rate proportionally to the minibatch size, while the warmup strategy gradually increases the learning rate to ensure stable training. These techniques enable the training of a ResNet-50 model with a minibatch size of 8192 on 256 GPUs in one hour, matching the accuracy of smaller minibatch sizes. The system achieves approximately 90% scaling efficiency when moving from 8 to 256 GPUs. The results show that large minibatch sizes do not degrade generalization performance, and the techniques generalize to other tasks like object detection and instance segmentation. The paper also discusses the importance of proper implementation details in distributed SGD, including weight decay, momentum correction, gradient aggregation, and data shuffling. The communication infrastructure used is efficient, leveraging standard Ethernet networking. The findings demonstrate that large minibatch training is feasible and efficient, enabling the training of accurate models on internet-scale data.This paper presents a method to train deep neural networks on the ImageNet dataset using large minibatch sizes, achieving high accuracy in just one hour with 256 GPUs. The key techniques include a hyperparameter-free linear scaling rule for learning rates and a warmup strategy to overcome optimization challenges. The linear scaling rule adjusts the learning rate proportionally to the minibatch size, while the warmup strategy gradually increases the learning rate to ensure stable training. These techniques enable the training of a ResNet-50 model with a minibatch size of 8192 on 256 GPUs in one hour, matching the accuracy of smaller minibatch sizes. The system achieves approximately 90% scaling efficiency when moving from 8 to 256 GPUs. The results show that large minibatch sizes do not degrade generalization performance, and the techniques generalize to other tasks like object detection and instance segmentation. The paper also discusses the importance of proper implementation details in distributed SGD, including weight decay, momentum correction, gradient aggregation, and data shuffling. The communication infrastructure used is efficient, leveraging standard Ethernet networking. The findings demonstrate that large minibatch training is feasible and efficient, enabling the training of accurate models on internet-scale data.
Reach us at info@study.space
Understanding Accurate%2C Large Minibatch SGD%3A Training ImageNet in 1 Hour