Understanding Don't Decay the Learning Rate%2C Increase the Batch Size

The paper "Don't Decay the Learning Rate, Increase the Batch Size" by Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le from Google Brain explores the alternative approach of increasing the batch size during training instead of decaying the learning rate. The authors demonstrate that this method can achieve similar or better test accuracies with fewer parameter updates, leading to greater parallelism and shorter training times. They show that increasing the batch size is equivalent to simulated annealing, where the noise scale in SGD dynamics is reduced. The paper also discusses the benefits of increasing the learning rate and scaling the batch size, as well as the impact of increasing the momentum coefficient. The authors provide experimental evidence that their approach can be applied to various optimizers, including SGD, SGD with momentum, Nesterov momentum, and Adam. They achieve efficient large-batch training on datasets like CIFAR-10 and ImageNet, reducing the number of parameter updates and training time significantly. For instance, they train ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes without any hyper-parameter tuning.The paper "Don't Decay the Learning Rate, Increase the Batch Size" by Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le from Google Brain explores the alternative approach of increasing the batch size during training instead of decaying the learning rate. The authors demonstrate that this method can achieve similar or better test accuracies with fewer parameter updates, leading to greater parallelism and shorter training times. They show that increasing the batch size is equivalent to simulated annealing, where the noise scale in SGD dynamics is reduced. The paper also discusses the benefits of increasing the learning rate and scaling the batch size, as well as the impact of increasing the momentum coefficient. The authors provide experimental evidence that their approach can be applied to various optimizers, including SGD, SGD with momentum, Nesterov momentum, and Adam. They achieve efficient large-batch training on datasets like CIFAR-10 and ImageNet, reducing the number of parameter updates and training time significantly. For instance, they train ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes without any hyper-parameter tuning.

DON'T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE

24 Feb 2018 | Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying & Quoc V. Le

DON'T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE

24 Feb 2018 | Samuel L. Smith*, Pieter-Jan Kindermans*, Chris Ying & Quoc V. Le

24 Feb 2018 | Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying & Quoc V. Le