2 Jul 2024 | Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
Observational scaling laws enable the prediction of complex language model (LM) performance without training models across multiple scales. By analyzing 80 publicly available models, we derive a low-dimensional capability space where LM performance is a function of this space, and model families vary in their efficiency in converting training compute to capabilities. This approach allows us to predict emergent capabilities, agentic performance, and the impact of post-training interventions like Chain-of-Thought and Self-Consistency. We show that these phenomena follow smooth, sigmoidal behaviors and can be accurately predicted from smaller models. Observational scaling laws also enable the comparison of models from different families on a unified scale, and provide a way to evaluate and optimize LMs by using a low-rank decomposition of existing benchmarks. Our results demonstrate that observational scaling is cost-effective, scalable, and provides high-resolution predictions for complex LM capabilities. We also show that our approach can be used to select low-cost model subsets for practical scaling analyses, and that these subsets maintain high prediction accuracy while significantly reducing evaluation costs. Finally, we discuss the potential applications of observational scaling laws beyond LM scaling, including their use as evaluation metrics and optimization targets for pretraining.Observational scaling laws enable the prediction of complex language model (LM) performance without training models across multiple scales. By analyzing 80 publicly available models, we derive a low-dimensional capability space where LM performance is a function of this space, and model families vary in their efficiency in converting training compute to capabilities. This approach allows us to predict emergent capabilities, agentic performance, and the impact of post-training interventions like Chain-of-Thought and Self-Consistency. We show that these phenomena follow smooth, sigmoidal behaviors and can be accurately predicted from smaller models. Observational scaling laws also enable the comparison of models from different families on a unified scale, and provide a way to evaluate and optimize LMs by using a low-rank decomposition of existing benchmarks. Our results demonstrate that observational scaling is cost-effective, scalable, and provides high-resolution predictions for complex LM capabilities. We also show that our approach can be used to select low-cost model subsets for practical scaling analyses, and that these subsets maintain high prediction accuracy while significantly reducing evaluation costs. Finally, we discuss the potential applications of observational scaling laws beyond LM scaling, including their use as evaluation metrics and optimization targets for pretraining.