2024-03-30 | Zhengxiao Du, Aohan Zeng, Yuxiao Dong, Jie Tang
This paper investigates the emergence of abilities in language models (LMs) from the perspective of pre-training loss, rather than model size or training compute. The authors argue that the pre-training loss is a better predictor of downstream task performance than model size or training compute. They demonstrate that models with the same pre-training loss but different model and data sizes achieve similar performance on various downstream tasks. They also find that a model exhibits emergent abilities on certain tasks when its pre-training loss falls below a specific threshold, regardless of the continuity of metrics used to measure these abilities. This suggests that emergent abilities are not exclusive to large models but can be observed in smaller models with lower pre-training losses. The authors redefine emergent abilities as those that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by extrapolating the performance trends of models with higher pre-training losses. The study shows that the pre-training loss is a good predictor of performance on downstream tasks, and that the performance of certain tasks only improves beyond the level of random guessing when the pre-training loss falls below a specific threshold. The authors also find that the loss thresholds for these tasks are the same, and that the performance remains at the level of random guessing even when performance on other tasks continues to improve. The study concludes that the pre-training loss is a better metric to represent the scaling effect of language models than model size or training compute, and that the new definition of emergent abilities offers a precise characterization of the critical junctures within training trajectories where emergent abilities manifest.This paper investigates the emergence of abilities in language models (LMs) from the perspective of pre-training loss, rather than model size or training compute. The authors argue that the pre-training loss is a better predictor of downstream task performance than model size or training compute. They demonstrate that models with the same pre-training loss but different model and data sizes achieve similar performance on various downstream tasks. They also find that a model exhibits emergent abilities on certain tasks when its pre-training loss falls below a specific threshold, regardless of the continuity of metrics used to measure these abilities. This suggests that emergent abilities are not exclusive to large models but can be observed in smaller models with lower pre-training losses. The authors redefine emergent abilities as those that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by extrapolating the performance trends of models with higher pre-training losses. The study shows that the pre-training loss is a good predictor of performance on downstream tasks, and that the performance of certain tasks only improves beyond the level of random guessing when the pre-training loss falls below a specific threshold. The authors also find that the loss thresholds for these tasks are the same, and that the performance remains at the level of random guessing even when performance on other tasks continues to improve. The study concludes that the pre-training loss is a better metric to represent the scaling effect of language models than model size or training compute, and that the new definition of emergent abilities offers a precise characterization of the critical junctures within training trajectories where emergent abilities manifest.