Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

29 May 2024 | Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi
This paper explores the complexity and inefficiency of using the cosine learning rate schedule in large language model (LLM) training, particularly in scaling experiments. The authors argue that the cosine schedule requires the training duration to match the cycle length, which is impractical and costly. They propose an alternative approach using a constant learning rate with a cooldown phase, which scales predictably and reliably similar to the cosine schedule. Additionally, they demonstrate that stochastic weight averaging (SWA) can improve performance along the training trajectory without additional training costs. The findings suggest that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. The paper also discusses the implications for scaling law research, showing that the proposed methods can reduce the cost of scaling experiments by a factor of one-third or more. Overall, the work highlights the importance of simpler and more flexible training schedules in LLM training, making scaling research more accessible and efficient.This paper explores the complexity and inefficiency of using the cosine learning rate schedule in large language model (LLM) training, particularly in scaling experiments. The authors argue that the cosine schedule requires the training duration to match the cycle length, which is impractical and costly. They propose an alternative approach using a constant learning rate with a cooldown phase, which scales predictably and reliably similar to the cosine schedule. Additionally, they demonstrate that stochastic weight averaging (SWA) can improve performance along the training trajectory without additional training costs. The findings suggest that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. The paper also discusses the implications for scaling law research, showing that the proposed methods can reduce the cost of scaling experiments by a factor of one-third or more. Overall, the work highlights the importance of simpler and more flexible training schedules in LLM training, making scaling research more accessible and efficient.
Reach us at info@study.space
[slides and audio] Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations