Understanding On LLMs-Driven Synthetic Data Generation%2C Curation%2C and Evaluation%3A A Survey

This paper presents a comprehensive survey on the generation, curation, and evaluation of synthetic data driven by Large Language Models (LLMs). The paper addresses the challenge of generating high-quality synthetic data using pretrained LLMs, which can alleviate the limitations of real-world data. It highlights the importance of synthetic data generation in deep learning and outlines the current state of research, identifying key areas of focus and gaps that remain to be addressed. The paper proposes a generic workflow for synthetic data generation, which includes data generation, curation, and evaluation. It discusses various methods for data generation, including prompt engineering, multi-step generation, and in-context learning. It also explores data curation techniques, such as high-quality sample filtering and label enhancement, and data evaluation methods, including direct and indirect evaluation. The paper also discusses future directions for research, including complex task decomposition, knowledge enhancement, synergy between large and small LLMs, and human-model collaboration. The paper concludes by emphasizing the importance of synthetic data generation in data-centric AI and the need for further research in this area.This paper presents a comprehensive survey on the generation, curation, and evaluation of synthetic data driven by Large Language Models (LLMs). The paper addresses the challenge of generating high-quality synthetic data using pretrained LLMs, which can alleviate the limitations of real-world data. It highlights the importance of synthetic data generation in deep learning and outlines the current state of research, identifying key areas of focus and gaps that remain to be addressed. The paper proposes a generic workflow for synthetic data generation, which includes data generation, curation, and evaluation. It discusses various methods for data generation, including prompt engineering, multi-step generation, and in-context learning. It also explores data curation techniques, such as high-quality sample filtering and label enhancement, and data evaluation methods, including direct and indirect evaluation. The paper also discusses future directions for research, including complex task decomposition, knowledge enhancement, synergy between large and small LLMs, and human-model collaboration. The paper concludes by emphasizing the importance of synthetic data generation in data-centric AI and the need for further research in this area.

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

14 Jun 2024 | Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang