2021-12-08 | Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu and Geoffrey Irving
This paper presents an analysis of Transformer-based language model performance across a wide range of scales, from models with tens of millions of parameters to a 280 billion parameter model called *Gopher*. The models are evaluated on 152 diverse tasks, achieving state-of-the-art performance in most cases. The gains from scale are most significant in areas such as reading comprehension, fact-checking, and toxic language identification, but less so in logical and mathematical reasoning. The authors provide a holistic analysis of the training dataset and model behavior, covering the intersection of model scale with bias and toxicity. They also discuss the application of language models to AI safety and the mitigation of downstream harms. Key findings include:
1. **Performance Improvements with Scale**: Gopher outperforms current state-of-the-art language models on 100 out of 124 tasks, with significant improvements in reading comprehension, humanities, ethics, STEM, and medicine categories.
2. **Toxicity and Bias Analysis**: Larger models are more likely to generate toxic responses when provided with toxic prompts but can also more accurately classify toxicity. The models show distributional biases in discourse about different groups, such as gender and occupation, and in sentiment towards social groups.
3. **Dialogue**: Dialogue-Prompted Gopher can emulate a conversational format to a decent quality, with responses not increasing in toxicity with model scale even when prompted with toxic questions.
The paper concludes by discussing the ethical and safe application of these models, including the need to mitigate undesirable behaviors before and after training.This paper presents an analysis of Transformer-based language model performance across a wide range of scales, from models with tens of millions of parameters to a 280 billion parameter model called *Gopher*. The models are evaluated on 152 diverse tasks, achieving state-of-the-art performance in most cases. The gains from scale are most significant in areas such as reading comprehension, fact-checking, and toxic language identification, but less so in logical and mathematical reasoning. The authors provide a holistic analysis of the training dataset and model behavior, covering the intersection of model scale with bias and toxicity. They also discuss the application of language models to AI safety and the mitigation of downstream harms. Key findings include:
1. **Performance Improvements with Scale**: Gopher outperforms current state-of-the-art language models on 100 out of 124 tasks, with significant improvements in reading comprehension, humanities, ethics, STEM, and medicine categories.
2. **Toxicity and Bias Analysis**: Larger models are more likely to generate toxic responses when provided with toxic prompts but can also more accurately classify toxicity. The models show distributional biases in discourse about different groups, such as gender and occupation, and in sentiment towards social groups.
3. **Dialogue**: Dialogue-Prompted Gopher can emulate a conversational format to a decent quality, with responses not increasing in toxicity with model scale even when prompted with toxic questions.
The paper concludes by discussing the ethical and safe application of these models, including the need to mitigate undesirable behaviors before and after training.