ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA

ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA

9 Feb 2017 | Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang
This paper investigates the generalization gap in large-batch training for deep learning. It is observed that large-batch methods tend to converge to sharp minimizers of the training and testing functions, which are associated with poorer generalization. In contrast, small-batch methods converge to flat minimizers, which are known to generalize better. The paper presents numerical evidence supporting this view and discusses strategies to mitigate the generalization gap in large-batch training. The study shows that large-batch methods, while faster, often result in models with lower generalization performance. This is attributed to the tendency of large-batch methods to converge to sharp minima, which are characterized by a high number of large positive eigenvalues in the Hessian matrix. These sharp minima are more sensitive to small perturbations and thus lead to worse generalization on new data. Small-batch methods, on the other hand, benefit from the noise introduced by smaller batches, which helps in escaping sharp minima and converging to flatter, more generalizable minima. Numerical experiments on various deep learning architectures confirm that large-batch methods produce sharper minima, leading to a generalization gap. The paper also explores strategies to improve the generalization of large-batch methods, including data augmentation, conservative training, and robust optimization. While these strategies help in reducing the generalization gap, they do not completely eliminate the issue of sharp minima. The paper concludes that large-batch methods face challenges in achieving good generalization due to their tendency to converge to sharp minima. Future research is needed to understand the underlying reasons for this phenomenon and to develop more effective training strategies that can improve the generalization of large-batch methods.This paper investigates the generalization gap in large-batch training for deep learning. It is observed that large-batch methods tend to converge to sharp minimizers of the training and testing functions, which are associated with poorer generalization. In contrast, small-batch methods converge to flat minimizers, which are known to generalize better. The paper presents numerical evidence supporting this view and discusses strategies to mitigate the generalization gap in large-batch training. The study shows that large-batch methods, while faster, often result in models with lower generalization performance. This is attributed to the tendency of large-batch methods to converge to sharp minima, which are characterized by a high number of large positive eigenvalues in the Hessian matrix. These sharp minima are more sensitive to small perturbations and thus lead to worse generalization on new data. Small-batch methods, on the other hand, benefit from the noise introduced by smaller batches, which helps in escaping sharp minima and converging to flatter, more generalizable minima. Numerical experiments on various deep learning architectures confirm that large-batch methods produce sharper minima, leading to a generalization gap. The paper also explores strategies to improve the generalization of large-batch methods, including data augmentation, conservative training, and robust optimization. While these strategies help in reducing the generalization gap, they do not completely eliminate the issue of sharp minima. The paper concludes that large-batch methods face challenges in achieving good generalization due to their tendency to converge to sharp minima. Future research is needed to understand the underlying reasons for this phenomenon and to develop more effective training strategies that can improve the generalization of large-batch methods.
Reach us at info@study.space
Understanding On Large-Batch Training for Deep Learning%3A Generalization Gap and Sharp Minima