Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

2024 | Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani
This paper investigates the impact of removing the square root from adaptive gradient methods, particularly in the context of second-order optimization. Adaptive methods like Adam are widely used in deep learning, but their diagonal preconditioner, based on the gradient outer product, includes a square root that introduces a fundamental difference from second-order methods. The study finds that removing the square root leads to square-root-free adaptive methods that close the generalization gap to SGD on convolutional architectures while maintaining performance on vision transformers. These methods also offer practical benefits, such as faster training with half-precision and improved computational efficiency. The paper explores the second-order perspective of adaptive methods, showing that the square root is not essential for their second-order motivation. It introduces the concept of preconditioner invariance, allowing arbitrary curvature approximations to be incorporated without requiring matrix root decompositions. This approach bridges the computational gap between diagonal and non-diagonal matrix methods and enables low-precision training. Theoretical considerations show that square-root-free methods are theoretically grounded, particularly in convex settings, and offer scale and affine invariance. They also address issues like ill-conditioning and numerical stability, making them suitable for modern training pipelines. Empirical results demonstrate that square-root-free methods perform well on various tasks, including CNNs, LSTMs, GNNs, and vision transformers, and outperform root-based methods in terms of speed and memory usage. The paper proposes root-free variants of RMSProp and Shampoo, which are invariant to loss scaling and affine reparametrization. These methods are implemented and tested, showing their effectiveness in modern training scenarios. Overall, the findings highlight the importance of adaptivity in the success of adaptive methods and suggest that removing the square root can lead to more efficient and effective optimization strategies.This paper investigates the impact of removing the square root from adaptive gradient methods, particularly in the context of second-order optimization. Adaptive methods like Adam are widely used in deep learning, but their diagonal preconditioner, based on the gradient outer product, includes a square root that introduces a fundamental difference from second-order methods. The study finds that removing the square root leads to square-root-free adaptive methods that close the generalization gap to SGD on convolutional architectures while maintaining performance on vision transformers. These methods also offer practical benefits, such as faster training with half-precision and improved computational efficiency. The paper explores the second-order perspective of adaptive methods, showing that the square root is not essential for their second-order motivation. It introduces the concept of preconditioner invariance, allowing arbitrary curvature approximations to be incorporated without requiring matrix root decompositions. This approach bridges the computational gap between diagonal and non-diagonal matrix methods and enables low-precision training. Theoretical considerations show that square-root-free methods are theoretically grounded, particularly in convex settings, and offer scale and affine invariance. They also address issues like ill-conditioning and numerical stability, making them suitable for modern training pipelines. Empirical results demonstrate that square-root-free methods perform well on various tasks, including CNNs, LSTMs, GNNs, and vision transformers, and outperform root-based methods in terms of speed and memory usage. The paper proposes root-free variants of RMSProp and Shampoo, which are invariant to loss scaling and affine reparametrization. These methods are implemented and tested, showing their effectiveness in modern training scenarios. Overall, the findings highlight the importance of adaptivity in the success of adaptive methods and suggest that removing the square root can lead to more efficient and effective optimization strategies.
Reach us at info@study.space
[slides] Can We Remove the Square-Root in Adaptive Gradient Methods%3F A Second-Order Perspective | StudySpace