7 Aug 2024 | Aaron Defazio, Xingyu (Alice) Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khale, Ashok Cutkosky
The paper introduces a learning rate schedule-free optimization method that achieves state-of-the-art performance across a wide range of problems, including convex and large-scale deep learning tasks. The method eliminates the need for specifying a learning rate schedule, instead using a novel approach that combines momentum and iterate averaging. This approach does not require additional hyperparameters beyond standard optimizers with momentum and maintains the worst-case convergence rate of Polyak-Ruppert averaging while often outperforming schedule-based methods.
The method is based on a new theory that unifies scheduling and iterate averaging, leading to an online-to-batch conversion theorem that provides theoretical guarantees. The approach uses an alternative form of momentum that is worst-case optimal for convex Lipschitz functions, offering better performance and stability compared to traditional momentum methods.
The method is evaluated on a wide range of problems, including deep learning tasks such as image classification, natural language processing, and recommendation systems, as well as convex optimization problems. The results show that Schedule-Free methods match or outperform heavily-tuned cosine schedules, achieving strong performance across various benchmarks.
The paper also discusses the practical implications of the method, including the use of larger learning rates without divergence, and highlights the importance of the momentum parameter in achieving convergence. The method is shown to be effective in both convex and non-convex settings, demonstrating its versatility and effectiveness in practice. The results indicate that Schedule-Free learning is a viable alternative to traditional scheduling methods, offering a simple and effective approach to optimization without the need for explicit schedule specification.The paper introduces a learning rate schedule-free optimization method that achieves state-of-the-art performance across a wide range of problems, including convex and large-scale deep learning tasks. The method eliminates the need for specifying a learning rate schedule, instead using a novel approach that combines momentum and iterate averaging. This approach does not require additional hyperparameters beyond standard optimizers with momentum and maintains the worst-case convergence rate of Polyak-Ruppert averaging while often outperforming schedule-based methods.
The method is based on a new theory that unifies scheduling and iterate averaging, leading to an online-to-batch conversion theorem that provides theoretical guarantees. The approach uses an alternative form of momentum that is worst-case optimal for convex Lipschitz functions, offering better performance and stability compared to traditional momentum methods.
The method is evaluated on a wide range of problems, including deep learning tasks such as image classification, natural language processing, and recommendation systems, as well as convex optimization problems. The results show that Schedule-Free methods match or outperform heavily-tuned cosine schedules, achieving strong performance across various benchmarks.
The paper also discusses the practical implications of the method, including the use of larger learning rates without divergence, and highlights the importance of the momentum parameter in achieving convergence. The method is shown to be effective in both convex and non-convex settings, demonstrating its versatility and effectiveness in practice. The results indicate that Schedule-Free learning is a viable alternative to traditional scheduling methods, offering a simple and effective approach to optimization without the need for explicit schedule specification.