28 Feb 2024 | Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, Lianwen Jin
This paper provides a comprehensive survey of large language model (LLM) datasets, categorizing them into five types: pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets. The survey analyzes 444 datasets across 8 language categories spanning 32 domains, covering 20 dimensions. The total data size exceeds 774.5 TB for pre-training corpora and 700M instances for other datasets. The paper discusses the challenges and future directions of LLM datasets, emphasizing the importance of high-quality data for LLM development. It also highlights the evolution of text datasets from early NLP tasks to the current era of LLMs, noting the shift from task-centric construction to a construction centered around tasks and stages. The paper summarizes general and domain-specific pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets, providing insights into their characteristics, construction methods, and applications. It also discusses the preprocessing steps for pre-training data, including data collection, filtering, deduplication, standardization, and review. The survey aims to provide researchers with a comprehensive understanding of LLM datasets, facilitating better development and application of LLMs.This paper provides a comprehensive survey of large language model (LLM) datasets, categorizing them into five types: pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets. The survey analyzes 444 datasets across 8 language categories spanning 32 domains, covering 20 dimensions. The total data size exceeds 774.5 TB for pre-training corpora and 700M instances for other datasets. The paper discusses the challenges and future directions of LLM datasets, emphasizing the importance of high-quality data for LLM development. It also highlights the evolution of text datasets from early NLP tasks to the current era of LLMs, noting the shift from task-centric construction to a construction centered around tasks and stages. The paper summarizes general and domain-specific pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets, providing insights into their characteristics, construction methods, and applications. It also discusses the preprocessing steps for pre-training data, including data collection, filtering, deduplication, standardization, and review. The survey aims to provide researchers with a comprehensive understanding of LLM datasets, facilitating better development and application of LLMs.