12 Jun 2024 | Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin
MAGPIE is a method for generating high-quality instruction data for fine-tuning large language models (LLMs) without requiring human intervention or access to proprietary models. The method leverages the auto-regressive nature of aligned LLMs, such as Llama-3-Instruct, to generate instruction data by inputting only the pre-query template. This allows the LLM to self-synthesize user queries, resulting in a large-scale dataset of instructions and responses. MAGPIE generates 4 million instructions and their corresponding responses using Llama-3-Instruct, and after filtering, selects 300K high-quality instances. The generated data is compared with other public instruction datasets, such as ShareGPT, WildChat, Evol-Instruct, UltraChat, OpenHermes, and Tulu-V2-Mix, and shows superior performance in alignment benchmarks like AlpacaEval, ArenaHard, and WildBench. MAGPIE also outperforms previous public datasets in preference optimization tasks. The method is scalable, cost-effective, and does not require prompt engineering or seed questions. The generated datasets, MAGPIE-Air and MAGPIE-Pro, are analyzed for coverage, task categories, quality, difficulty, similarity, and safety. The results show that MAGPIE generates high-quality, diverse, and safe instruction data that can be used for fine-tuning LLMs. The method is effective for both instruction and preference tuning, and the generated data can be used to improve the performance of other models, such as Qwen1.5. The research also discusses the limitations and ethical considerations of the method, including the potential for harmful instructions and the need for further research on domain-specific instruction data.MAGPIE is a method for generating high-quality instruction data for fine-tuning large language models (LLMs) without requiring human intervention or access to proprietary models. The method leverages the auto-regressive nature of aligned LLMs, such as Llama-3-Instruct, to generate instruction data by inputting only the pre-query template. This allows the LLM to self-synthesize user queries, resulting in a large-scale dataset of instructions and responses. MAGPIE generates 4 million instructions and their corresponding responses using Llama-3-Instruct, and after filtering, selects 300K high-quality instances. The generated data is compared with other public instruction datasets, such as ShareGPT, WildChat, Evol-Instruct, UltraChat, OpenHermes, and Tulu-V2-Mix, and shows superior performance in alignment benchmarks like AlpacaEval, ArenaHard, and WildBench. MAGPIE also outperforms previous public datasets in preference optimization tasks. The method is scalable, cost-effective, and does not require prompt engineering or seed questions. The generated datasets, MAGPIE-Air and MAGPIE-Pro, are analyzed for coverage, task categories, quality, difficulty, similarity, and safety. The results show that MAGPIE generates high-quality, diverse, and safe instruction data that can be used for fine-tuning LLMs. The method is effective for both instruction and preference tuning, and the generated data can be used to improve the performance of other models, such as Qwen1.5. The research also discusses the limitations and ethical considerations of the method, including the potential for harmful instructions and the need for further research on domain-specific instruction data.