AgentInstruct: Toward Generative Teaching with Agentic Flows

AgentInstruct: Toward Generative Teaching with Agentic Flows

3 Jul 2024 | Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousogos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah
AgentInstruct is an agentic framework for generating high-quality, diverse synthetic data for post-training of language models. It creates both prompts and responses using raw data sources like text documents and code files as seeds. The framework uses agentic flows with multiple agents, tools, and reflection to generate a wide range of data across various skills, including text editing, creative writing, coding, and more. A 25 million pair dataset was created to teach language models different skills, and Mistral-7B was post-trained with this data to produce Orca-3, which significantly outperformed other models on multiple benchmarks, including AGIEval, MMLU, GSM8K, BBH, and AlpacaEval. Orca-3 also showed a 31.34% reduction in hallucination across summarization benchmarks. The framework enables the creation of synthetic data generation as a service, allowing for continual learning and improvement of any base LLM. AgentInstruct is also effective for self-improvement of larger models due to its ability to generate new prompts and high-quality responses. The framework was tested on various benchmarks, including reading comprehension, math, format following, abstractive summarization, and RAG, showing significant improvements across all tasks. Despite its effectiveness, AgentInstruct has limitations, including the need for human effort in constructing agentic flows, potential inaccuracies in synthetic data, resource-intensive generation, possible bias in seed data, and challenges in validating synthetic data.AgentInstruct is an agentic framework for generating high-quality, diverse synthetic data for post-training of language models. It creates both prompts and responses using raw data sources like text documents and code files as seeds. The framework uses agentic flows with multiple agents, tools, and reflection to generate a wide range of data across various skills, including text editing, creative writing, coding, and more. A 25 million pair dataset was created to teach language models different skills, and Mistral-7B was post-trained with this data to produce Orca-3, which significantly outperformed other models on multiple benchmarks, including AGIEval, MMLU, GSM8K, BBH, and AlpacaEval. Orca-3 also showed a 31.34% reduction in hallucination across summarization benchmarks. The framework enables the creation of synthetic data generation as a service, allowing for continual learning and improvement of any base LLM. AgentInstruct is also effective for self-improvement of larger models due to its ability to generate new prompts and high-quality responses. The framework was tested on various benchmarks, including reading comprehension, math, format following, abstractive summarization, and RAG, showing significant improvements across all tasks. Despite its effectiveness, AgentInstruct has limitations, including the need for human effort in constructing agentic flows, potential inaccuracies in synthetic data, resource-intensive generation, possible bias in seed data, and challenges in validating synthetic data.
Reach us at info@study.space
Understanding AgentInstruct%3A Toward Generative Teaching with Agentic Flows