[slides and audio] Algorithmic progress in language models

This paper investigates the rate of improvement in algorithms for pre-training language models since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank from 2012 to 2023, the authors find that the compute required to reach a set performance threshold has halved approximately every 8 months, significantly faster than the hardware gains per Moore’s Law. They estimate augmented scaling laws to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite rapid algorithmic progress and the development of new architectures like the transformer, the analysis reveals that the increase in compute made a larger contribution to overall performance improvements. The paper also discusses the significance of the transformer architecture, which represents a substantial compute-equivalent gain, and highlights the limitations of the study, including the lack of specific innovation estimates and the limited availability of quality data. Overall, the research provides a comprehensive empirical analysis of algorithmic progress in language modeling, emphasizing the dominant role of scale rather than algorithms for recent gains.This paper investigates the rate of improvement in algorithms for pre-training language models since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank from 2012 to 2023, the authors find that the compute required to reach a set performance threshold has halved approximately every 8 months, significantly faster than the hardware gains per Moore’s Law. They estimate augmented scaling laws to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite rapid algorithmic progress and the development of new architectures like the transformer, the analysis reveals that the increase in compute made a larger contribution to overall performance improvements. The paper also discusses the significance of the transformer architecture, which represents a substantial compute-equivalent gain, and highlights the limitations of the study, including the lack of specific innovation estimates and the limited availability of quality data. Overall, the research provides a comprehensive empirical analysis of algorithmic progress in language modeling, emphasizing the dominant role of scale rather than algorithms for recent gains.

ALGORITHMIC PROGRESS IN LANGUAGE MODELS

9 Mar 2024 | Anson Ho1† Tamay Besiroglu1,2† Ege Erdil1 David Owen1 Robi Rahman1 Zifan Carl Guo2 David Atkinson1,3 Neil Thompson2 Jaime Sevilla1