AgentInstruct: Toward Generative Teaching with Agentic Flows

AgentInstruct: Toward Generative Teaching with Agentic Flows

3 Jul 2024 | Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah
**AgentInstruct: Toward Generative Teaching with Agentic Flows** Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fililipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah Microsoft Research **Abstract:** Synthetic data is crucial for accelerating the development of language models, but it often requires significant human effort to curate. This paper introduces AgentInstruct, an agentic framework for creating diverse and high-quality synthetic data for post-training. AgentInstruct uses raw data sources like text documents and code files as seeds to generate both prompts and responses. The framework is demonstrated by creating a 25M pair dataset to teach language models various skills, such as text editing, creative writing, tool usage, coding, and reading comprehension. The dataset is used to fine-tune the Mistral-7b model, resulting in a significant improvement over other instruction-tuned models, achieving 40% improvement on AGIEval, 19% on MMLU, 54% on GSM8K, 38% on BBH, and 45% on AlpacaEval. AgentInstruct can also enable synthetic data generation as a service, facilitating continuous learning and improvement of any base LLM. **Introduction:** Synthetic data has significantly accelerated the development of Large Language Models (LLMs), but generating high-quality synthetic data remains challenging. AgentInstruct addresses this by creating diverse and high-quality synthetic data for post-training, focusing on teaching new skills to AI models. The framework uses powerful models like GPT-4 and tools like search APIs and code interpreters to generate data. It can create large amounts of data, ensuring diversity and complexity through iterative refinement and automation. **Generative Teaching: AgentInstruct:** AgentInstruct outlines a structured approach to generate synthetic data for various skills, including reading comprehension, text modification, and tool use. The framework consists of three flows: Content Transformation, Seed Instruction Generation, and Instruction Refinement. Each flow uses multiple agents to transform raw seeds into high-quality, diverse prompts and responses. The framework is evaluated on multiple benchmarks, showing significant improvements over baseline models. **Evaluation:** The performance of the fine-tuned Mistral-7b model (Orca-3) is compared against other models using various benchmarks, demonstrating substantial improvements in reading comprehension, math, format following, abstractive summarization, and RAG capabilities. Orca-3 outperforms models like LLAMA-8B-instruct and GPT-3.5-turbo, achieving 40% improvement on AGIEval, 19% on MMLU, 54% on GSM8K, 38**AgentInstruct: Toward Generative Teaching with Agentic Flows** Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fililipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah Microsoft Research **Abstract:** Synthetic data is crucial for accelerating the development of language models, but it often requires significant human effort to curate. This paper introduces AgentInstruct, an agentic framework for creating diverse and high-quality synthetic data for post-training. AgentInstruct uses raw data sources like text documents and code files as seeds to generate both prompts and responses. The framework is demonstrated by creating a 25M pair dataset to teach language models various skills, such as text editing, creative writing, tool usage, coding, and reading comprehension. The dataset is used to fine-tune the Mistral-7b model, resulting in a significant improvement over other instruction-tuned models, achieving 40% improvement on AGIEval, 19% on MMLU, 54% on GSM8K, 38% on BBH, and 45% on AlpacaEval. AgentInstruct can also enable synthetic data generation as a service, facilitating continuous learning and improvement of any base LLM. **Introduction:** Synthetic data has significantly accelerated the development of Large Language Models (LLMs), but generating high-quality synthetic data remains challenging. AgentInstruct addresses this by creating diverse and high-quality synthetic data for post-training, focusing on teaching new skills to AI models. The framework uses powerful models like GPT-4 and tools like search APIs and code interpreters to generate data. It can create large amounts of data, ensuring diversity and complexity through iterative refinement and automation. **Generative Teaching: AgentInstruct:** AgentInstruct outlines a structured approach to generate synthetic data for various skills, including reading comprehension, text modification, and tool use. The framework consists of three flows: Content Transformation, Seed Instruction Generation, and Instruction Refinement. Each flow uses multiple agents to transform raw seeds into high-quality, diverse prompts and responses. The framework is evaluated on multiple benchmarks, showing significant improvements over baseline models. **Evaluation:** The performance of the fine-tuned Mistral-7b model (Orca-3) is compared against other models using various benchmarks, demonstrating substantial improvements in reading comprehension, math, format following, abstractive summarization, and RAG capabilities. Orca-3 outperforms models like LLAMA-8B-instruct and GPT-3.5-turbo, achieving 40% improvement on AGIEval, 19% on MMLU, 54% on GSM8K, 38
Reach us at info@study.space