6 Jun 2024 | Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo
The paper explores why predicting the downstream capabilities of advanced AI models with scale remains challenging. While pretraining performance scaling is well-established, predicting specific downstream capabilities is less predictable. The authors identify a new factor that makes modeling this behavior difficult: the sequence of transformations used to compute downstream metrics like accuracy, Brier Score, and probability mass on the correct choice. These transformations degrade the statistical relationship between performance and scale, making it harder to predict how performance changes with increasing compute. The key issue is that these metrics require comparing the correct choice against a small set of specific incorrect choices, which introduces variability and unpredictability. The authors empirically study how probability mass on incorrect choices fluctuates with increasing compute, suggesting that achieving more predictable scaling behavior might be possible by modeling these fluctuations. This work contributes to the development of more reliable evaluations for frontier AI models, emphasizing the importance of understanding the factors affecting downstream performance.The paper explores why predicting the downstream capabilities of advanced AI models with scale remains challenging. While pretraining performance scaling is well-established, predicting specific downstream capabilities is less predictable. The authors identify a new factor that makes modeling this behavior difficult: the sequence of transformations used to compute downstream metrics like accuracy, Brier Score, and probability mass on the correct choice. These transformations degrade the statistical relationship between performance and scale, making it harder to predict how performance changes with increasing compute. The key issue is that these metrics require comparing the correct choice against a small set of specific incorrect choices, which introduces variability and unpredictability. The authors empirically study how probability mass on incorrect choices fluctuates with increasing compute, suggesting that achieving more predictable scaling behavior might be possible by modeling these fluctuations. This work contributes to the development of more reliable evaluations for frontier AI models, emphasizing the importance of understanding the factors affecting downstream performance.