24 Feb 2018 | Samuel L. Smith*, Pieter-Jan Kindermans*, Chris Ying & Quoc V. Le
This paper argues that increasing the batch size during training can achieve the same learning curve as decaying the learning rate, leading to fewer parameter updates, faster training, and equivalent test accuracy. The authors show that this approach works for SGD, SGD with momentum, Nesterov momentum, and Adam. By increasing the batch size while keeping the learning rate constant, they achieve similar results to traditional learning rate decay, but with significantly fewer parameter updates and faster training times. They also show that increasing the learning rate and scaling the batch size proportionally to the learning rate can further reduce the number of parameter updates. Additionally, increasing the momentum coefficient and scaling the batch size proportionally to the inverse of (1 - m) can also reduce the number of parameter updates, though this may slightly reduce test accuracy. The authors demonstrate this approach on ResNet-50 and Inception-ResNet-V2, achieving high validation accuracy with minimal training time. They also show that this approach can be applied to large batch training without hyper-parameter tuning, by converting existing training schedules into batch size schedules. The paper concludes that increasing the batch size during training is a viable alternative to learning rate decay, and that this approach can significantly reduce training time while maintaining high model performance.This paper argues that increasing the batch size during training can achieve the same learning curve as decaying the learning rate, leading to fewer parameter updates, faster training, and equivalent test accuracy. The authors show that this approach works for SGD, SGD with momentum, Nesterov momentum, and Adam. By increasing the batch size while keeping the learning rate constant, they achieve similar results to traditional learning rate decay, but with significantly fewer parameter updates and faster training times. They also show that increasing the learning rate and scaling the batch size proportionally to the learning rate can further reduce the number of parameter updates. Additionally, increasing the momentum coefficient and scaling the batch size proportionally to the inverse of (1 - m) can also reduce the number of parameter updates, though this may slightly reduce test accuracy. The authors demonstrate this approach on ResNet-50 and Inception-ResNet-V2, achieving high validation accuracy with minimal training time. They also show that this approach can be applied to large batch training without hyper-parameter tuning, by converting existing training schedules into batch size schedules. The paper concludes that increasing the batch size during training is a viable alternative to learning rate decay, and that this approach can significantly reduce training time while maintaining high model performance.