14 Jun 2024 | Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang
This paper provides a comprehensive survey of the advancements in synthetic data generation driven by Large Language Models (LLMs). It highlights the challenges and opportunities in this field, aiming to guide both academic and industrial communities towards deeper and more methodical inquiries. The paper organizes relevant studies based on a generic workflow of synthetic data generation, including generation, curation, and evaluation. Key aspects such as prompt engineering, conditional prompting, multi-step generation, data curation, and evaluation methods are discussed in detail. The authors identify gaps in existing research and propose future directions, emphasizing the importance of complex task decomposition, knowledge enhancement, synergy between large and small LMs, and human-model collaboration. The paper concludes by outlining the potential benefits and ethical considerations of LLMs-driven synthetic data generation.This paper provides a comprehensive survey of the advancements in synthetic data generation driven by Large Language Models (LLMs). It highlights the challenges and opportunities in this field, aiming to guide both academic and industrial communities towards deeper and more methodical inquiries. The paper organizes relevant studies based on a generic workflow of synthetic data generation, including generation, curation, and evaluation. Key aspects such as prompt engineering, conditional prompting, multi-step generation, data curation, and evaluation methods are discussed in detail. The authors identify gaps in existing research and propose future directions, emphasizing the importance of complex task decomposition, knowledge enhancement, synergy between large and small LMs, and human-model collaboration. The paper concludes by outlining the potential benefits and ethical considerations of LLMs-driven synthetic data generation.