30 Mar 2024 | Zhengxiao Du1,2, Aohan Zeng1,2, Yuxiao Dong2, Jie Tang2
This paper challenges the belief that emergent abilities in language models are exclusive to large models by proposing a new perspective on understanding these abilities through the lens of pre-training loss. The authors pre-train over 30 language models of varying sizes and evaluate their performance on 12 diverse downstream tasks. They find that the pre-training loss is a strong predictor of downstream task performance, regardless of model size or data size. Specifically, they observe that certain tasks exhibit emergent abilities—i.e., performance improvements beyond random guessing—when the pre-training loss falls below a specific threshold. This threshold is consistent across different tasks, even when evaluated using continuous metrics. Based on these findings, the authors redefine emergent abilities as abilities that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by extrapolating the performance trends of higher pre-training loss models. The paper also discusses the limitations of their approach, such as the impact of tokenizer and corpus distribution on pre-training loss, and suggests future directions for research.This paper challenges the belief that emergent abilities in language models are exclusive to large models by proposing a new perspective on understanding these abilities through the lens of pre-training loss. The authors pre-train over 30 language models of varying sizes and evaluate their performance on 12 diverse downstream tasks. They find that the pre-training loss is a strong predictor of downstream task performance, regardless of model size or data size. Specifically, they observe that certain tasks exhibit emergent abilities—i.e., performance improvements beyond random guessing—when the pre-training loss falls below a specific threshold. This threshold is consistent across different tasks, even when evaluated using continuous metrics. Based on these findings, the authors redefine emergent abilities as abilities that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by extrapolating the performance trends of higher pre-training loss models. The paper also discusses the limitations of their approach, such as the impact of tokenizer and corpus distribution on pre-training loss, and suggests future directions for research.