MAMmoTH2: Scaling Instructions from the Web

MAMmoTH2: Scaling Instructions from the Web

23 May 2024 | Xiang Yue*, Tuney Zheng*, Ge Zhang*, Wenhu Chen*
MAmmoTH2: Scaling Instructions from the Web This paper proposes a method to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance large language models (LLMs) reasoning. The approach involves three steps: (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. The dataset, called WEBINSTRUCT, is mined from the web without any human crowdsourcing or GPT-4 distillation. Training base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data. The results show that MAmmoTH2-Plus outperforms other models on various benchmarks, including reasoning, code generation, and chatbot tasks. The study also highlights the effectiveness of the three-step pipeline in generating high-quality instruction data and the potential of using web-based instruction data to improve LLM reasoning capabilities.MAmmoTH2: Scaling Instructions from the Web This paper proposes a method to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance large language models (LLMs) reasoning. The approach involves three steps: (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. The dataset, called WEBINSTRUCT, is mined from the web without any human crowdsourcing or GPT-4 distillation. Training base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data. The results show that MAmmoTH2-Plus outperforms other models on various benchmarks, including reasoning, code generation, and chatbot tasks. The study also highlights the effectiveness of the three-step pipeline in generating high-quality instruction data and the potential of using web-based instruction data to improve LLM reasoning capabilities.
Reach us at info@study.space
[slides] MAmmoTH2%3A Scaling Instructions from the Web | StudySpace