9 Feb 2017 | Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang
The paper investigates the generalization gap observed in large-batch training for deep learning models, which often leads to poorer performance compared to small-batch methods. The authors explore why large-batch methods tend to converge to sharp minimizers of the training function, characterized by a significant number of large positive eigenvalues in the Hessian matrix, while small-batch methods converge to flat minimizers with numerous small eigenvalues. Numerical experiments support the hypothesis that sharp minimizers generalize poorly due to their sensitivity to perturbations. The paper also discusses strategies to mitigate the generalization gap, such as data augmentation, conservative training, and robust optimization, but notes that these approaches do not fully resolve the issue. The authors conclude by posing several open questions, including whether large-batch methods can be steered away from sharp minimizers and how to design neural network architectures that are more suitable for large-batch training.The paper investigates the generalization gap observed in large-batch training for deep learning models, which often leads to poorer performance compared to small-batch methods. The authors explore why large-batch methods tend to converge to sharp minimizers of the training function, characterized by a significant number of large positive eigenvalues in the Hessian matrix, while small-batch methods converge to flat minimizers with numerous small eigenvalues. Numerical experiments support the hypothesis that sharp minimizers generalize poorly due to their sensitivity to perturbations. The paper also discusses strategies to mitigate the generalization gap, such as data augmentation, conservative training, and robust optimization, but notes that these approaches do not fully resolve the issue. The authors conclude by posing several open questions, including whether large-batch methods can be steered away from sharp minimizers and how to design neural network architectures that are more suitable for large-batch training.