30 Jan 2024 | Debarati Das*, Karin de Langis*, Anna Martin-Boyle*, Jaehyung Kim*, Minhwa Lee*, Zae Myung Kim*, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Sachin Parkar, Ryan Koo, Jong Inn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang
This paper explores the expanding role of large language models (LLMs) in generating artificial data, including annotations, preferences, instruction prompts, simulated dialogues, and free text. The study aims to understand the quality and diversity of LLM-generated data and its impact on training cycles, highlighting the formation of an *artificial data ecosystem*. The research is the first to aggregate various types of LLM-generated text data, from constrained "task labels" to less constrained "free-form text," and stress tests their quality and implications compared to human data. Despite LLMs' ability to match human performance in some benchmarks, significant disparities are revealed, especially in complex tasks where LLMs often lack nuanced understanding of human-generated content. The study emphasizes the need for ethical practices in data creation and the development of LLMs, as these models struggle to replicate human traits and behaviors, leading to biases and artifacts in LLM-generated content. The findings highlight the importance of addressing these issues to ensure the development of AI systems that benefit society ethically and sustainably.This paper explores the expanding role of large language models (LLMs) in generating artificial data, including annotations, preferences, instruction prompts, simulated dialogues, and free text. The study aims to understand the quality and diversity of LLM-generated data and its impact on training cycles, highlighting the formation of an *artificial data ecosystem*. The research is the first to aggregate various types of LLM-generated text data, from constrained "task labels" to less constrained "free-form text," and stress tests their quality and implications compared to human data. Despite LLMs' ability to match human performance in some benchmarks, significant disparities are revealed, especially in complex tasks where LLMs often lack nuanced understanding of human-generated content. The study emphasizes the need for ethical practices in data creation and the development of LLMs, as these models struggle to replicate human traits and behaviors, leading to biases and artifacts in LLM-generated content. The findings highlight the importance of addressing these issues to ensure the development of AI systems that benefit society ethically and sustainably.