The paper investigates the predictability of large language model (LLM) performance across five orders of magnitude in compute scaling, using eleven recent model architectures. The study focuses on two widely-used benchmarks: BIG-Bench and MMLU. The authors find that aggregate benchmark performance is reasonably predictable from scaling laws, with an average absolute error of 6 percentage points (pp) when extrapolating across one order of magnitude in compute. However, individual tasks within these benchmarks are less predictable, with an average error of 18 pp. The study suggests that compute scaling provides a promising basis for forecasting AI capabilities in diverse benchmarks, though predicting performance in specific tasks remains challenging. The paper also discusses the limitations and implications of the findings, highlighting the need for more challenging benchmarks and better modeling of task relationships to improve predictability.The paper investigates the predictability of large language model (LLM) performance across five orders of magnitude in compute scaling, using eleven recent model architectures. The study focuses on two widely-used benchmarks: BIG-Bench and MMLU. The authors find that aggregate benchmark performance is reasonably predictable from scaling laws, with an average absolute error of 6 percentage points (pp) when extrapolating across one order of magnitude in compute. However, individual tasks within these benchmarks are less predictable, with an average error of 18 pp. The study suggests that compute scaling provides a promising basis for forecasting AI capabilities in diverse benchmarks, though predicting performance in specific tasks remains challenging. The paper also discusses the limitations and implications of the findings, highlighting the need for more challenging benchmarks and better modeling of task relationships to improve predictability.