Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

25 Jul 2024 | Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon
This paper addresses the discrepancy between the scaling laws proposed by Kaplan et al. [26] and Hoffmann et al. [22] for optimal model size as a function of compute budget. The authors reproduce the Kaplan et al. scaling law on two datasets (OpenWebText2 and RefinedWeb) and identify three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. By correcting these factors, they achieve excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Contrary to Hoffmann et al.'s hypothesis, they find that careful learning rate decay is not essential for the validity of the Hoffmann et al. scaling law. Additionally, they derive scaling laws for the optimal learning rate and batch size, concluding that tuning the AdamW β2 parameter is crucial at lower batch sizes. The paper includes detailed experimental results, hyperparameter sweeps, and analyses of the optimal loss and computational costs, providing insights into the optimal resource allocation for language model training.This paper addresses the discrepancy between the scaling laws proposed by Kaplan et al. [26] and Hoffmann et al. [22] for optimal model size as a function of compute budget. The authors reproduce the Kaplan et al. scaling law on two datasets (OpenWebText2 and RefinedWeb) and identify three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. By correcting these factors, they achieve excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Contrary to Hoffmann et al.'s hypothesis, they find that careful learning rate decay is not essential for the validity of the Hoffmann et al. scaling law. Additionally, they derive scaling laws for the optimal learning rate and batch size, concluding that tuning the AdamW β2 parameter is crucial at lower batch sizes. The paper includes detailed experimental results, hyperparameter sweeps, and analyses of the optimal loss and computational costs, providing insights into the optimal resource allocation for language model training.
Reach us at info@study.space