Unraveling the Mystery of Scaling Laws: Part I

Unraveling the Mystery of Scaling Laws: Part I

5 Apr 2024 | Hui Su*, Zhi Tian*, Xiaoyu Shen, Xunliang Cai
This technical report investigates the validity and practical application of scaling laws in large language models. The original OpenAI paper introduced scaling laws that describe the power-law relationship between model performance and factors such as model size, dataset size, and computational resources. However, the paper did not provide sufficient details to derive precise formulas, and its conclusions were based on models with up to 1.5 billion parameters. Subsequent works have attempted to expand these laws to larger models but often failed to account for important factors like learning rate, context length, and batch size, leading to unreliable predictions. The authors confirm that the scaling law formulas from the original OpenAI paper remain valid when scaling up to 33 billion parameters, but the constant coefficients vary depending on the experiment setup. By training on smaller models with 1M to 60M parameters, they estimate all constant terms in the scaling-law formulas. Using these formulas, they demonstrate the ability to accurately predict various attributes of models with up to 33 billion parameters before training, including the minimum possible test loss, the minimum required training steps and tokens to achieve a specific loss, the critical batch size for optimal time/computation trade-off, and the complete test loss trajectory with arbitrary batch size. The study also shows how scaling laws can help determine the most suitable batch size, model size, dataset mix ratio, and training duration under fixed computational constraints. The research represents a shift from theoretical understanding of scaling laws to their practical derivation and application, aiming to advance the development of large-scale language models. The key results include the influence of hyperparameters on convergence rate, the trade-off between time and computation in adjusting batch size, and the impact of context length, tokenization, and data distribution on the constants in scaling law formulas. The study provides transparent, step-by-step instructions for estimating all constant terms in scaling-law formulas and demonstrates the effectiveness of these formulas in predicting model performance.This technical report investigates the validity and practical application of scaling laws in large language models. The original OpenAI paper introduced scaling laws that describe the power-law relationship between model performance and factors such as model size, dataset size, and computational resources. However, the paper did not provide sufficient details to derive precise formulas, and its conclusions were based on models with up to 1.5 billion parameters. Subsequent works have attempted to expand these laws to larger models but often failed to account for important factors like learning rate, context length, and batch size, leading to unreliable predictions. The authors confirm that the scaling law formulas from the original OpenAI paper remain valid when scaling up to 33 billion parameters, but the constant coefficients vary depending on the experiment setup. By training on smaller models with 1M to 60M parameters, they estimate all constant terms in the scaling-law formulas. Using these formulas, they demonstrate the ability to accurately predict various attributes of models with up to 33 billion parameters before training, including the minimum possible test loss, the minimum required training steps and tokens to achieve a specific loss, the critical batch size for optimal time/computation trade-off, and the complete test loss trajectory with arbitrary batch size. The study also shows how scaling laws can help determine the most suitable batch size, model size, dataset mix ratio, and training duration under fixed computational constraints. The research represents a shift from theoretical understanding of scaling laws to their practical derivation and application, aiming to advance the development of large-scale language models. The key results include the influence of hyperparameters on convergence rate, the trade-off between time and computation in adjusting batch size, and the impact of context length, tokenization, and data distribution on the constants in scaling law formulas. The study provides transparent, step-by-step instructions for estimating all constant terms in scaling-law formulas and demonstrates the effectiveness of these formulas in predicting model performance.
Reach us at info@study.space
[slides and audio] Unraveling the Mystery of Scaling Laws%3A Part I