Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models

29 Mar 2022 | Jordan Hoffmann*, Sebastian Borgeaud*, Arthur Mensch*, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Webl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and Laurent Sifre*
This paper investigates the optimal model size and number of training tokens for training a transformer language model under a given compute budget. The authors find that current large language models are significantly undertrained due to the recent focus on scaling models while keeping training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they find that for compute-optimal training, model size and training tokens should be scaled equally. They test this hypothesis by training a predicted compute-optimal model, Chinchilla, which uses the same compute budget as Gopher but with 70B parameters and 4× more data. Chinchilla uniformly and significantly outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher. The paper also discusses the scaling behavior of language models and their transfer properties, and how the optimal model size and number of training tokens should be scaled with increasing compute budget. The authors propose three different approaches to estimate the optimal parameter/training tokens allocation. All three approaches suggest that model size and the amount of training data should be increased in approximately equal proportions with more compute. The results show that Chinchilla outperforms Gopher and other large models on various tasks, including language modelling, reading comprehension, common sense, and closed-book question answering. Chinchilla also performs better on tasks that require understanding of gender bias and toxicity. The paper concludes that the current generation of large language models are considerably over-sized given their respective compute budgets, and that smaller models should have been trained on more tokens to achieve the most performant model.This paper investigates the optimal model size and number of training tokens for training a transformer language model under a given compute budget. The authors find that current large language models are significantly undertrained due to the recent focus on scaling models while keeping training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they find that for compute-optimal training, model size and training tokens should be scaled equally. They test this hypothesis by training a predicted compute-optimal model, Chinchilla, which uses the same compute budget as Gopher but with 70B parameters and 4× more data. Chinchilla uniformly and significantly outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher. The paper also discusses the scaling behavior of language models and their transfer properties, and how the optimal model size and number of training tokens should be scaled with increasing compute budget. The authors propose three different approaches to estimate the optimal parameter/training tokens allocation. All three approaches suggest that model size and the amount of training data should be increased in approximately equal proportions with more compute. The results show that Chinchilla outperforms Gopher and other large models on various tasks, including language modelling, reading comprehension, common sense, and closed-book question answering. Chinchilla also performs better on tasks that require understanding of gender bias and toxicity. The paper concludes that the current generation of large language models are considerably over-sized given their respective compute budgets, and that smaller models should have been trained on more tokens to achieve the most performant model.
Reach us at info@study.space