30 Apr 2018 | Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He
This paper presents a method to train deep neural networks with large minibatches, specifically up to 8192 images, using distributed synchronous Stochastic Gradient Descent (SGD). The authors address the challenges of optimization with large minibatches, particularly on the ImageNet dataset, and demonstrate that these issues can be mitigated without sacrificing accuracy. They propose a hyperparameter-free linear scaling rule for adjusting learning rates based on minibatch size and a new warmup scheme to overcome early training optimization difficulties. Using this approach, they achieve a significant reduction in training time, training ResNet-50 with 256 GPUs in one hour while maintaining the same accuracy as training with smaller minibatches. The techniques are shown to generalize to more complex tasks like object detection and instance segmentation, and the authors provide detailed guidelines and implementation details to ensure correct and efficient distributed SGD. The findings enable efficient training of visual recognition models on large-scale datasets, with potential applications in industrial and research domains.This paper presents a method to train deep neural networks with large minibatches, specifically up to 8192 images, using distributed synchronous Stochastic Gradient Descent (SGD). The authors address the challenges of optimization with large minibatches, particularly on the ImageNet dataset, and demonstrate that these issues can be mitigated without sacrificing accuracy. They propose a hyperparameter-free linear scaling rule for adjusting learning rates based on minibatch size and a new warmup scheme to overcome early training optimization difficulties. Using this approach, they achieve a significant reduction in training time, training ResNet-50 with 256 GPUs in one hour while maintaining the same accuracy as training with smaller minibatches. The techniques are shown to generalize to more complex tasks like object detection and instance segmentation, and the authors provide detailed guidelines and implementation details to ensure correct and efficient distributed SGD. The findings enable efficient training of visual recognition models on large-scale datasets, with potential applications in industrial and research domains.