29 May 2024 | Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi
This paper investigates the effectiveness of alternative learning rate schedules for training large language models (LLMs), challenging the traditional use of the cosine learning rate schedule. The authors argue that the cosine schedule is overly complex and inefficient, as it requires training models from scratch for different lengths to evaluate scaling behavior, which is both computationally expensive and restrictive. Instead, they propose a simpler alternative: a constant learning rate with a cooldown phase, which has been shown to perform similarly to cosine schedules. They also explore the use of stochastic weight averaging (SWA) and a schedule-free optimizer, which can improve performance without additional training costs.
The study demonstrates that a constant learning rate with a cooldown phase can achieve performance comparable to the cosine schedule, with the added benefit of allowing for more flexible training. The cooldown phase, which gradually reduces the learning rate, is shown to be effective in achieving lower loss and better performance. Additionally, SWA is found to improve generalization and performance, although it does not fully match the performance of the cooldown phase.
The authors also show that scaling experiments can be significantly reduced in compute and GPU hours by using fewer but reusable training runs. This approach allows for more frequent computation of scaling laws for different data mixtures and architectures, making scaling research more accessible and efficient. The findings suggest that the traditional reliance on the cosine schedule is unnecessary, and that simpler alternatives can achieve similar or better results with reduced computational costs.
The paper concludes that the use of a constant learning rate with a cooldown phase, along with SWA and schedule-free optimizers, provides a more efficient and flexible approach to training LLMs. These methods enable more frequent and cost-effective scaling experiments, which is crucial for advancing the field of large language model research.This paper investigates the effectiveness of alternative learning rate schedules for training large language models (LLMs), challenging the traditional use of the cosine learning rate schedule. The authors argue that the cosine schedule is overly complex and inefficient, as it requires training models from scratch for different lengths to evaluate scaling behavior, which is both computationally expensive and restrictive. Instead, they propose a simpler alternative: a constant learning rate with a cooldown phase, which has been shown to perform similarly to cosine schedules. They also explore the use of stochastic weight averaging (SWA) and a schedule-free optimizer, which can improve performance without additional training costs.
The study demonstrates that a constant learning rate with a cooldown phase can achieve performance comparable to the cosine schedule, with the added benefit of allowing for more flexible training. The cooldown phase, which gradually reduces the learning rate, is shown to be effective in achieving lower loss and better performance. Additionally, SWA is found to improve generalization and performance, although it does not fully match the performance of the cooldown phase.
The authors also show that scaling experiments can be significantly reduced in compute and GPU hours by using fewer but reusable training runs. This approach allows for more frequent computation of scaling laws for different data mixtures and architectures, making scaling research more accessible and efficient. The findings suggest that the traditional reliance on the cosine schedule is unnecessary, and that simpler alternatives can achieve similar or better results with reduced computational costs.
The paper concludes that the use of a constant learning rate with a cooldown phase, along with SWA and schedule-free optimizers, provides a more efficient and flexible approach to training LLMs. These methods enable more frequent and cost-effective scaling experiments, which is crucial for advancing the field of large language model research.