Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

25 Jul 2024 | Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon
This paper addresses the discrepancy between two scaling laws for optimal language model training: one proposed by Kaplan et al. [26] and another by Hoffmann et al. [22]. The authors investigate the reasons behind the difference in predictions and find that three factors contribute to the discrepancy: the computational cost of the last layer, the warmup duration, and scale-dependent optimizer tuning. By correcting these factors, they achieve excellent agreement with the Hoffmann et al. scaling law. They also find that careful learning rate decay is not essential for the validity of the Hoffmann et al. scaling law. Additionally, they derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW β₂ parameter is essential at lower batch sizes. The authors also show that a constant learning rate schedule can be used to achieve the Hoffmann et al. scaling law, demonstrating that learning rate decay is not necessary. The study provides insights into how to correctly predict and perform model scaling, with implications for the design and training of large language models. The results are supported by extensive experiments on two datasets, OpenWebText2 and RefinedWeb, and the authors provide code and data for reproducibility.This paper addresses the discrepancy between two scaling laws for optimal language model training: one proposed by Kaplan et al. [26] and another by Hoffmann et al. [22]. The authors investigate the reasons behind the difference in predictions and find that three factors contribute to the discrepancy: the computational cost of the last layer, the warmup duration, and scale-dependent optimizer tuning. By correcting these factors, they achieve excellent agreement with the Hoffmann et al. scaling law. They also find that careful learning rate decay is not essential for the validity of the Hoffmann et al. scaling law. Additionally, they derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW β₂ parameter is essential at lower batch sizes. The authors also show that a constant learning rate schedule can be used to achieve the Hoffmann et al. scaling law, demonstrating that learning rate decay is not necessary. The study provides insights into how to correctly predict and perform model scaling, with implications for the design and training of large language models. The results are supported by extensive experiments on two datasets, OpenWebText2 and RefinedWeb, and the authors provide code and data for reproducibility.
Reach us at info@study.space