9 Mar 2024 | Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla
This paper investigates the rate of algorithmic progress in pre-training language models since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, the authors find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. They estimate augmented scaling laws, which enable them to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, their analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. The study quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms. The authors find that the compute required to reach a fixed performance threshold has halved approximately every 8 months, much faster than Moore's Law. They also find that the introduction of the transformer architecture in 2017 was a major algorithmic advance, representing between 3x and 46x in compute-equivalent gain, which accounts for more than 10% of the algorithmic innovation in pre-trained language models in the past decade. The study highlights the significance of the transformer as a key architectural breakthrough in the field. The authors conclude that while algorithmic innovations have occurred rapidly, compute scaling has expanded by over a million-fold in this same period, exceeding the gains from algorithms and constituting the predominant source of performance improvements in recent years. Overall, their work provides a quantitative estimate of the rapid pace of progress in language modeling and reveals the dominant role of scale rather than algorithms for recent gains.This paper investigates the rate of algorithmic progress in pre-training language models since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, the authors find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. They estimate augmented scaling laws, which enable them to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, their analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. The study quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms. The authors find that the compute required to reach a fixed performance threshold has halved approximately every 8 months, much faster than Moore's Law. They also find that the introduction of the transformer architecture in 2017 was a major algorithmic advance, representing between 3x and 46x in compute-equivalent gain, which accounts for more than 10% of the algorithmic innovation in pre-trained language models in the past decade. The study highlights the significance of the transformer as a key architectural breakthrough in the field. The authors conclude that while algorithmic innovations have occurred rapidly, compute scaling has expanded by over a million-fold in this same period, exceeding the gains from algorithms and constituting the predominant source of performance improvements in recent years. Overall, their work provides a quantitative estimate of the rapid pace of progress in language modeling and reveals the dominant role of scale rather than algorithms for recent gains.