Language Models are Few-Shot Learners

Language Models are Few-Shot Learners

22 Jul 2020 | Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
This paper presents the results of training a 175 billion parameter autoregressive language model called GPT-3 and evaluating its performance in the few-shot learning setting. The model is tested on a wide range of NLP tasks, including language modeling, cloze tasks, translation, question-answering, and reasoning tasks. The results show that GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation. GPT-3 also demonstrates proficiency in tasks designed to test rapid adaptation or on-the-fly reasoning, such as unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them defined only once. The model is also shown to generate synthetic news articles which human evaluators have difficulty distinguishing from human-generated articles. The paper also identifies some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. The results show that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. The paper discusses broader societal impacts of this finding and of GPT-3 in general. The paper also undertakes a systematic study of "data contamination" – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper, we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3's performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity. In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners. Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3's characteristics in this regard.This paper presents the results of training a 175 billion parameter autoregressive language model called GPT-3 and evaluating its performance in the few-shot learning setting. The model is tested on a wide range of NLP tasks, including language modeling, cloze tasks, translation, question-answering, and reasoning tasks. The results show that GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation. GPT-3 also demonstrates proficiency in tasks designed to test rapid adaptation or on-the-fly reasoning, such as unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them defined only once. The model is also shown to generate synthetic news articles which human evaluators have difficulty distinguishing from human-generated articles. The paper also identifies some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. The results show that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. The paper discusses broader societal impacts of this finding and of GPT-3 in general. The paper also undertakes a systematic study of "data contamination" – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper, we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3's performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity. In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners. Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3's characteristics in this regard.
Reach us at info@study.space
Understanding Language Models are Few-Shot Learners