29 Mar 2022 | Jordan Hoffmann*, Sebastian Borgeaud*, Arthur Mensch*, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Webl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and Laurent Sifre*
The paper investigates the optimal model size and number of training tokens for transformer language models under a given computational budget. The authors find that current large language models are under-trained, as they have been scaled up in size while keeping the amount of training data constant. By training over 400 language models with varying sizes and token counts, they conclude that for optimal training, the model size and the number of training tokens should be scaled equally. Specifically, for every doubling of the model size, the number of training tokens should also double. To test this hypothesis, they train a smaller but more compute-optimal model called *Chinchilla*, which uses the same computational budget as *Gopher* but with 70 billion parameters and 4 times more data. *Chinchilla* significantly outperforms *Gopher* and other large models on various downstream tasks, achieving a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, a 7% improvement over *Gopher*. The paper also highlights the reduced inference cost and downstream usage benefits of *Chinchilla*. The authors discuss the implications of their findings for the future development of large language models, emphasizing the importance of dataset scaling and ethical considerations.The paper investigates the optimal model size and number of training tokens for transformer language models under a given computational budget. The authors find that current large language models are under-trained, as they have been scaled up in size while keeping the amount of training data constant. By training over 400 language models with varying sizes and token counts, they conclude that for optimal training, the model size and the number of training tokens should be scaled equally. Specifically, for every doubling of the model size, the number of training tokens should also double. To test this hypothesis, they train a smaller but more compute-optimal model called *Chinchilla*, which uses the same computational budget as *Gopher* but with 70 billion parameters and 4 times more data. *Chinchilla* significantly outperforms *Gopher* and other large models on various downstream tasks, achieving a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, a 7% improvement over *Gopher*. The paper also highlights the reduced inference cost and downstream usage benefits of *Chinchilla*. The authors discuss the implications of their findings for the future development of large language models, emphasizing the importance of dataset scaling and ethical considerations.