Understanding GenQA%3A Generating Millions of Instructions from a Handful of Prompts

GenQA is an instruction dataset generated entirely by large language models (LLMs) without human intervention. The dataset aims to address the need for industrial-scale datasets in finetuning, which are crucial for studying large-scale finetuning practices such as curriculum and learning rate cooldown schedules. By using a single prompt, GenQA can produce a diverse set of instruction examples, ranging from simple completion tasks to complex multi-turn dialogs across various subjects. The dataset is designed to be large and diverse, with over 10 million samples, and it performs well on both knowledge-intensive leaderboard tasks and conversational evaluations when finetuned on a Llama-3 8B base model. The key innovation in GenQA is the use of generator prompts, which boost the randomness of LLM outputs and result in high-quality, diverse datasets. These prompts are crafted to encourage the LLM to generate a wide range of questions and answers, ensuring that the dataset covers multiple topics and question styles. The dataset is divided into several splits, each created using different meta-prompts, and it includes approximately 2.8 billion whitespace-delimited words. Empirical evaluations show that models trained on GenQA perform well on various benchmarks, including instruction-following tasks and conversational evaluations. The dataset is available for public use, along with the "generator" prompts used to create it and the finetuned model checkpoints. The study demonstrates the feasibility of automated dataset creation and highlights the potential of GenQA for advancing research in industrial-scale finetuning practices.GenQA is an instruction dataset generated entirely by large language models (LLMs) without human intervention. The dataset aims to address the need for industrial-scale datasets in finetuning, which are crucial for studying large-scale finetuning practices such as curriculum and learning rate cooldown schedules. By using a single prompt, GenQA can produce a diverse set of instruction examples, ranging from simple completion tasks to complex multi-turn dialogs across various subjects. The dataset is designed to be large and diverse, with over 10 million samples, and it performs well on both knowledge-intensive leaderboard tasks and conversational evaluations when finetuned on a Llama-3 8B base model. The key innovation in GenQA is the use of generator prompts, which boost the randomness of LLM outputs and result in high-quality, diverse datasets. These prompts are crafted to encourage the LLM to generate a wide range of questions and answers, ensuring that the dataset covers multiple topics and question styles. The dataset is divided into several splits, each created using different meta-prompts, and it includes approximately 2.8 billion whitespace-delimited words. Empirical evaluations show that models trained on GenQA perform well on various benchmarks, including instruction-following tasks and conversational evaluations. The dataset is available for public use, along with the "generator" prompts used to create it and the finetuned model checkpoints. The study demonstrates the feasibility of automated dataset creation and highlights the potential of GenQA for advancing research in industrial-scale finetuning practices.

GenQA: Generating Millions of Instructions from a Handful of Prompts

14 Jun 2024 | Jiuhai Chen, Rifaa Qadri, Yuxin Wen, Neel Jain, John Kirchenbauer, Tianyi Zhou, Tom Goldstein