Understanding Training Compute-Optimal Large Language Models

The paper investigates the optimal model size and number of training tokens for transformer language models under a given computational budget. The authors find that current large language models are under-trained, as they have been scaled up in size while keeping the amount of training data constant. By training over 400 language models with varying sizes and token counts, they conclude that for optimal training, the model size and the number of training tokens should be scaled equally. Specifically, for every doubling of the model size, the number of training tokens should also double. To test this hypothesis, they train a smaller but more compute-optimal model called *Chinchilla*, which uses the same computational budget as *Gopher* but with 70 billion parameters and 4 times more data. *Chinchilla* significantly outperforms *Gopher* and other large models on various downstream tasks, achieving a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, a 7% improvement over *Gopher*. The paper also highlights the reduced inference cost and downstream usage benefits of *Chinchilla*. The authors discuss the implications of their findings for the future development of large language models, emphasizing the importance of dataset scaling and ethical considerations.The paper investigates the optimal model size and number of training tokens for transformer language models under a given computational budget. The authors find that current large language models are under-trained, as they have been scaled up in size while keeping the amount of training data constant. By training over 400 language models with varying sizes and token counts, they conclude that for optimal training, the model size and the number of training tokens should be scaled equally. Specifically, for every doubling of the model size, the number of training tokens should also double. To test this hypothesis, they train a smaller but more compute-optimal model called *Chinchilla*, which uses the same computational budget as *Gopher* but with 70 billion parameters and 4 times more data. *Chinchilla* significantly outperforms *Gopher* and other large models on various downstream tasks, achieving a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, a 7% improvement over *Gopher*. The paper also highlights the reduced inference cost and downstream usage benefits of *Chinchilla*. The authors discuss the implications of their findings for the future development of large language models, emphasizing the importance of dataset scaling and ethical considerations.