31 Dec 2020 | Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy
The paper introduces *the Pile*, an 825 GiB English text corpus designed for training large-scale language models. The Pile is constructed from 22 diverse and high-quality subsets, including both existing and newly constructed datasets from various sources such as academic and professional domains. The evaluation of GPT-2 and GPT-3 models on the Pile shows that these models struggle on many components, such as academic writing. However, models trained on the Pile significantly outperform both raw and filtered Common Crawl models on all components of the Pile and improve performance on downstream evaluations. The paper also documents potentially concerning aspects of the data, such as the topical distribution, pejorative content, bias, and sentiment co-occurrence, and provides a detailed analysis of the structural statistics of the dataset. The authors emphasize the importance of documenting the dataset to address ethical concerns and promote transparency in AI research. The Pile is made publicly available, along with the preprocessing code and documentation, to facilitate further research and development in the field of natural language processing.The paper introduces *the Pile*, an 825 GiB English text corpus designed for training large-scale language models. The Pile is constructed from 22 diverse and high-quality subsets, including both existing and newly constructed datasets from various sources such as academic and professional domains. The evaluation of GPT-2 and GPT-3 models on the Pile shows that these models struggle on many components, such as academic writing. However, models trained on the Pile significantly outperform both raw and filtered Common Crawl models on all components of the Pile and improve performance on downstream evaluations. The paper also documents potentially concerning aspects of the data, such as the topical distribution, pejorative content, bias, and sentiment co-occurrence, and provides a detailed analysis of the structural statistics of the dataset. The authors emphasize the importance of documenting the dataset to address ethical concerns and promote transparency in AI research. The Pile is made publicly available, along with the preprocessing code and documentation, to facilitate further research and development in the field of natural language processing.