[slides and audio] Datasets for Large Language Models%3A A Comprehensive Survey

This paper provides a comprehensive survey of datasets used in Large Language Models (LLMs), highlighting their crucial role in the development and advancement of LLMs. The survey categorizes these datasets into five main types: pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional Natural Language Processing (NLP) datasets. It aims to address the lack of a systematic overview and analysis of LLM datasets, providing insights into their current status and future trends. The survey includes statistics from 444 datasets, covering 8 language categories and 32 domains, with a total data size exceeding 774.5 TB for pre-training corpora and over 700M instances for other datasets. The paper discusses the challenges and potential avenues for future research, emphasizing the importance of high-quality datasets in the development of LLMs. The development of text datasets has evolved from early NLP task datasets to LLM datasets, with significant advancements in scale, complexity, and diversity. The survey categorizes LLM datasets into four types: pre-training corpora, instruction fine-tuning datasets, preference datasets, and evaluation datasets, each playing a pivotal role in the pre-training and evaluation stages of LLMs. The paper also provides detailed information on general and domain-specific pre-training corpora, including their construction methods, data types, and domain coverage. It highlights the importance of diverse data types and multi-category corpora in enhancing the generalization and performance of LLMs. Additionally, the paper discusses the preprocessing steps for pre-training data, such as data collection, filtering, deduplication, standardization, and review, to ensure data quality and compliance with legal and ethical standards. Overall, the survey serves as a comprehensive reference for researchers in the field of LLMs, contributing to the advancement of both theoretical understanding and practical applications.This paper provides a comprehensive survey of datasets used in Large Language Models (LLMs), highlighting their crucial role in the development and advancement of LLMs. The survey categorizes these datasets into five main types: pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional Natural Language Processing (NLP) datasets. It aims to address the lack of a systematic overview and analysis of LLM datasets, providing insights into their current status and future trends. The survey includes statistics from 444 datasets, covering 8 language categories and 32 domains, with a total data size exceeding 774.5 TB for pre-training corpora and over 700M instances for other datasets. The paper discusses the challenges and potential avenues for future research, emphasizing the importance of high-quality datasets in the development of LLMs. The development of text datasets has evolved from early NLP task datasets to LLM datasets, with significant advancements in scale, complexity, and diversity. The survey categorizes LLM datasets into four types: pre-training corpora, instruction fine-tuning datasets, preference datasets, and evaluation datasets, each playing a pivotal role in the pre-training and evaluation stages of LLMs. The paper also provides detailed information on general and domain-specific pre-training corpora, including their construction methods, data types, and domain coverage. It highlights the importance of diverse data types and multi-category corpora in enhancing the generalization and performance of LLMs. Additionally, the paper discusses the preprocessing steps for pre-training data, such as data collection, filtering, deduplication, standardization, and review, to ensure data quality and compliance with legal and ethical standards. Overall, the survey serves as a comprehensive reference for researchers in the field of LLMs, contributing to the advancement of both theoretical understanding and practical applications.

Datasets for Large Language Models: A Comprehensive Survey

28 Feb 2024 | Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, Lianwen Jin