2024-06-14 | Jiuhai Chen, Rifaa Qadri, Yuxin Wen, Neel Jain, John Kirchenbauer, Tianyi Zhou, Tom Goldstein
GenQA is an instruction dataset generated autonomously by an LLM without human input. The dataset is created using generator prompts that produce diverse instruction examples across various subjects. The dataset contains over 11 million questions and is split into nine parts, with a total of approximately 2.8 billion words. The dataset is designed to reflect the scale of instruction samples used to fine-tune Llama 3. When fine-tuned on Llama-3 8B, the dataset performs well on both knowledge-intensive and conversational benchmarks, matching or surpassing datasets created with extensive human labor.
The GenQA dataset is created using generator prompts that enhance diversity by generating long lists of possible choices and then selecting one at random. This approach increases the variety of generated questions and answers compared to static prompts. The dataset is evaluated for diversity using nearest-neighbor similarity scores, with generator-conditional and generator-nested prompts yielding the highest diversity. Adding a randomness booster to the prompts further improves data diversity.
The GenQA dataset is compared to existing instruction datasets such as WizardLM and UltraChat, and it is found to have comparable diversity. Finetuning on GenQA achieves high performance on benchmark tasks, including AlpacaEval and MT-Bench. The dataset is also effective for specific tasks such as math reasoning, outperforming other datasets in several benchmarks.
The GenQA dataset is created by generating prompts for each split and using them to produce instruction examples. The dataset is rebalanced to ensure equal representation across splits, and the final dataset is used for fine-tuning. The results show that GenQA produces high-quality fine-tuned models, demonstrating the effectiveness of automated data generation in creating large, diverse instruction datasets. The dataset is valuable for research on industrial-scale fine-tuning practices and can be used to create datasets for other domains.GenQA is an instruction dataset generated autonomously by an LLM without human input. The dataset is created using generator prompts that produce diverse instruction examples across various subjects. The dataset contains over 11 million questions and is split into nine parts, with a total of approximately 2.8 billion words. The dataset is designed to reflect the scale of instruction samples used to fine-tune Llama 3. When fine-tuned on Llama-3 8B, the dataset performs well on both knowledge-intensive and conversational benchmarks, matching or surpassing datasets created with extensive human labor.
The GenQA dataset is created using generator prompts that enhance diversity by generating long lists of possible choices and then selecting one at random. This approach increases the variety of generated questions and answers compared to static prompts. The dataset is evaluated for diversity using nearest-neighbor similarity scores, with generator-conditional and generator-nested prompts yielding the highest diversity. Adding a randomness booster to the prompts further improves data diversity.
The GenQA dataset is compared to existing instruction datasets such as WizardLM and UltraChat, and it is found to have comparable diversity. Finetuning on GenQA achieves high performance on benchmark tasks, including AlpacaEval and MT-Bench. The dataset is also effective for specific tasks such as math reasoning, outperforming other datasets in several benchmarks.
The GenQA dataset is created by generating prompts for each split and using them to produce instruction examples. The dataset is rebalanced to ensure equal representation across splits, and the final dataset is used for fine-tuning. The results show that GenQA produces high-quality fine-tuned models, demonstrating the effectiveness of automated data generation in creating large, diverse instruction datasets. The dataset is valuable for research on industrial-scale fine-tuning practices and can be used to create datasets for other domains.