Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021-12-08 | Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu and Geoffrey Irving
This paper presents an analysis of the performance of Transformer-based language models across a wide range of scales, from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. The models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. The paper provides a holistic analysis of the training dataset and model's behavior, covering the intersection of model scale with bias and toxicity. It also discusses the application of language models to AI safety and the mitigation of downstream harms. The paper describes the training of a state-of-the-art large language model, Gopher, with a 280 billion parameter model. The paper outlines the methods of architecture specification, optimization, infrastructure, and the curation of a high-quality text dataset called MassiveText. It performs a broad analysis of benchmark performance across 152 tasks that examine several diverse aspects of intelligence, and summarizes the key results. The paper finds that Gopher lifts the performance over current state-of-the-art language models across roughly 81% of tasks containing comparable results, notably in knowledge-intensive domains such as fact checking and general knowledge. The paper examines model toxicity and bias in the context of how scale influences these properties. It finds that larger models are more likely to generate toxic responses when provided with toxic prompts, but they can also more accurately classify toxicity. The paper also analyzes Gopher in a dialogue-interaction setting via prompting and presents several transcripts to demonstrate qualitative capabilities and limitations of the model. The paper discusses the ethical and safe application of these models, including which types of undesirable behavior to mitigate before and after training. It discusses application-driven safety and the potential for language models to accelerate research towards safer intelligent technology. The paper also explores the impact of model scale on performance, finding that scale benefits certain tasks such as reading comprehension, fact-checking, and the identification of toxic language, but less so for logical and mathematical reasoning. The paper highlights the importance of dataset quality and the need for further research into mitigating biases and toxic content in language models.This paper presents an analysis of the performance of Transformer-based language models across a wide range of scales, from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. The models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. The paper provides a holistic analysis of the training dataset and model's behavior, covering the intersection of model scale with bias and toxicity. It also discusses the application of language models to AI safety and the mitigation of downstream harms. The paper describes the training of a state-of-the-art large language model, Gopher, with a 280 billion parameter model. The paper outlines the methods of architecture specification, optimization, infrastructure, and the curation of a high-quality text dataset called MassiveText. It performs a broad analysis of benchmark performance across 152 tasks that examine several diverse aspects of intelligence, and summarizes the key results. The paper finds that Gopher lifts the performance over current state-of-the-art language models across roughly 81% of tasks containing comparable results, notably in knowledge-intensive domains such as fact checking and general knowledge. The paper examines model toxicity and bias in the context of how scale influences these properties. It finds that larger models are more likely to generate toxic responses when provided with toxic prompts, but they can also more accurately classify toxicity. The paper also analyzes Gopher in a dialogue-interaction setting via prompting and presents several transcripts to demonstrate qualitative capabilities and limitations of the model. The paper discusses the ethical and safe application of these models, including which types of undesirable behavior to mitigate before and after training. It discusses application-driven safety and the potential for language models to accelerate research towards safer intelligent technology. The paper also explores the impact of model scale on performance, finding that scale benefits certain tasks such as reading comprehension, fact-checking, and the identification of toxic language, but less so for logical and mathematical reasoning. The paper highlights the importance of dataset quality and the need for further research into mitigating biases and toxic content in language models.
Reach us at info@study.space