[slides and audio] Unraveling the Mystery of Scaling Laws%3A Part I

This technical report by Hui Su, Zhi Tian, Xiaoyu Shen, and Xunliang Cai explores the scaling laws in large language models, focusing on optimizing various aspects of model pre-training. The authors confirm that the scaling law formulations proposed by OpenAI remain valid when scaling model size up to 33 billion parameters, but the constant coefficients vary significantly with different experimental setups. They provide a detailed, step-by-step guide to estimating these constant terms using models with 1M to 60M parameters, demonstrating the ability to accurately predict various attributes of models up to 33 billion parameters before training. The report highlights the importance of hyperparameters such as batch size, learning rate, and learning rate scheduler, which influence convergence rates but not the final converged loss. The authors also illustrate how scaling laws can aid in determining optimal batch sizes, model sizes, dataset mix ratios, and training durations under fixed computational constraints. The research aims to advance the development of large-scale language models by shifting the understanding of scaling laws from theoretical concepts to practical implementation.This technical report by Hui Su, Zhi Tian, Xiaoyu Shen, and Xunliang Cai explores the scaling laws in large language models, focusing on optimizing various aspects of model pre-training. The authors confirm that the scaling law formulations proposed by OpenAI remain valid when scaling model size up to 33 billion parameters, but the constant coefficients vary significantly with different experimental setups. They provide a detailed, step-by-step guide to estimating these constant terms using models with 1M to 60M parameters, demonstrating the ability to accurately predict various attributes of models up to 33 billion parameters before training. The report highlights the importance of hyperparameters such as batch size, learning rate, and learning rate scheduler, which influence convergence rates but not the final converged loss. The authors also illustrate how scaling laws can aid in determining optimal batch sizes, model sizes, dataset mix ratios, and training durations under fixed computational constraints. The research aims to advance the development of large-scale language models by shifting the understanding of scaling laws from theoretical concepts to practical implementation.

Unraveling the Mystery of Scaling Laws: Part I

5 Apr 2024 | Hui Su, Zhi Tian, Xiaoyu Shen, Xunliang Cai

Unraveling the Mystery of Scaling Laws: Part I

5 Apr 2024 | Hui Su*, Zhi Tian*, Xiaoyu Shen, Xunliang Cai

5 Apr 2024 | Hui Su, Zhi Tian, Xiaoyu Shen, Xunliang Cai