2024 | Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani
This paper explores the impact of removing the square root from adaptive gradient methods, such as Adam, and investigates how this change affects their performance and theoretical properties. The authors argue that the square root in these methods, while often motivated as an approximate second-order correction, introduces a fundamental difference that hinders their interpretation and optimization. By removing the square root, the paper demonstrates that these methods can close the generalization gap with stochastic gradient descent (SGD) on convolutional neural networks (CNNs) while maintaining their performance on transformers. The removal of the square root also simplifies the implementation and computational cost, making these methods more suitable for low-precision training. The paper introduces a second-order perspective on adaptive methods, viewing the gradient outer product as a new empirical Fisher matrix, which helps in understanding the role of adaptivity and sign descent in these methods. Additionally, the paper proposes root-free RMSProp and inverse-free Shampoo, which are invariant to scaling and affine reparameterization, and perform well on various models, including CNNs, LSTMs, GNNs, and transformers. The findings provide new insights into the development of adaptive methods and highlight the importance of adaptivity in their success.This paper explores the impact of removing the square root from adaptive gradient methods, such as Adam, and investigates how this change affects their performance and theoretical properties. The authors argue that the square root in these methods, while often motivated as an approximate second-order correction, introduces a fundamental difference that hinders their interpretation and optimization. By removing the square root, the paper demonstrates that these methods can close the generalization gap with stochastic gradient descent (SGD) on convolutional neural networks (CNNs) while maintaining their performance on transformers. The removal of the square root also simplifies the implementation and computational cost, making these methods more suitable for low-precision training. The paper introduces a second-order perspective on adaptive methods, viewing the gradient outer product as a new empirical Fisher matrix, which helps in understanding the role of adaptivity and sign descent in these methods. Additionally, the paper proposes root-free RMSProp and inverse-free Shampoo, which are invariant to scaling and affine reparameterization, and perform well on various models, including CNNs, LSTMs, GNNs, and transformers. The findings provide new insights into the development of adaptive methods and highlight the importance of adaptivity in their success.