22 May 2018 | Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht
This paper investigates the generalization properties of adaptive gradient methods in machine learning. The authors show that adaptive methods like AdaGrad, RMSProp, and Adam often find solutions that generalize worse than non-adaptive methods like SGD or SGD with momentum. In a simple binary classification problem with linearly separable data, SGD achieves zero test error, while adaptive methods achieve test errors arbitrarily close to 0.5. The authors also study the generalization performance of adaptive methods on several deep learning models and find that they often generalize worse than SGD, even when they have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods for training neural networks.
The paper also shows that adaptive methods can overfit by giving undue influence to spurious features that have no effect on out-of-sample generalization. The authors provide a simple generative model where adaptive methods find a solution that incorrectly classifies new data with probability arbitrarily close to half, while SGD finds a solution with zero error on new data. This suggests that adaptive methods may not always find the best generalizing solutions.
The paper also presents numerical experiments demonstrating that adaptive methods generalize worse than their non-adaptive counterparts. The experiments reveal that SGD and SGD with momentum outperform adaptive methods on the development/test set across all evaluated models and tasks. Adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set. The authors also propose a simple scheme for tuning learning rates and decays that performs well on all deep learning tasks studied.
The paper concludes that despite the popularity of adaptive methods, they may not always be the best choice for training neural networks. The authors suggest that practitioners should reconsider the use of adaptive methods and instead use non-adaptive methods like SGD or SGD with momentum for better generalization.This paper investigates the generalization properties of adaptive gradient methods in machine learning. The authors show that adaptive methods like AdaGrad, RMSProp, and Adam often find solutions that generalize worse than non-adaptive methods like SGD or SGD with momentum. In a simple binary classification problem with linearly separable data, SGD achieves zero test error, while adaptive methods achieve test errors arbitrarily close to 0.5. The authors also study the generalization performance of adaptive methods on several deep learning models and find that they often generalize worse than SGD, even when they have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods for training neural networks.
The paper also shows that adaptive methods can overfit by giving undue influence to spurious features that have no effect on out-of-sample generalization. The authors provide a simple generative model where adaptive methods find a solution that incorrectly classifies new data with probability arbitrarily close to half, while SGD finds a solution with zero error on new data. This suggests that adaptive methods may not always find the best generalizing solutions.
The paper also presents numerical experiments demonstrating that adaptive methods generalize worse than their non-adaptive counterparts. The experiments reveal that SGD and SGD with momentum outperform adaptive methods on the development/test set across all evaluated models and tasks. Adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set. The authors also propose a simple scheme for tuning learning rates and decays that performs well on all deep learning tasks studied.
The paper concludes that despite the popularity of adaptive methods, they may not always be the best choice for training neural networks. The authors suggest that practitioners should reconsider the use of adaptive methods and instead use non-adaptive methods like SGD or SGD with momentum for better generalization.