MAMmoTH2: Scaling Instructions from the Web

MAMmoTH2: Scaling Instructions from the Web

23 May 2024 | Xiang Yue*, Tuney Zheng*, Ge Zhang*, Wenhuh Chen*
The paper "MAmmoTH2: Scaling Instructions from the Web" by Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen from Carnegie Mellon University and the University of Waterloo proposes a novel approach to enhance the reasoning abilities of large language models (LLMs) by harvesting 10 million naturally existing instruction-response pairs from web pre-training corpora. The authors develop a three-step pipeline: recall, extract, and refine. They use a fastText model to recall relevant documents from the Common Crawl corpus, extract Q&A pairs using open-source LLMs like Mixtral, and refine these pairs to improve quality. The resulting dataset, WEBINSTRUCT, is used to fine-tune LLMs, leading to significant improvements in reasoning benchmarks such as MATH and GSM8K. Further fine-tuning on additional public datasets yields MAmmoTH2-Plus, which achieves state-of-the-art performance on multiple benchmarks. The study demonstrates the effectiveness of harvesting large-scale, high-quality instruction data from the web without costly human annotation or GPT-4 distillation, providing a new paradigm for instruction tuning.The paper "MAmmoTH2: Scaling Instructions from the Web" by Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen from Carnegie Mellon University and the University of Waterloo proposes a novel approach to enhance the reasoning abilities of large language models (LLMs) by harvesting 10 million naturally existing instruction-response pairs from web pre-training corpora. The authors develop a three-step pipeline: recall, extract, and refine. They use a fastText model to recall relevant documents from the Common Crawl corpus, extract Q&A pairs using open-source LLMs like Mixtral, and refine these pairs to improve quality. The resulting dataset, WEBINSTRUCT, is used to fine-tune LLMs, leading to significant improvements in reasoning benchmarks such as MATH and GSM8K. Further fine-tuning on additional public datasets yields MAmmoTH2-Plus, which achieves state-of-the-art performance on multiple benchmarks. The study demonstrates the effectiveness of harvesting large-scale, high-quality instruction data from the web without costly human annotation or GPT-4 distillation, providing a new paradigm for instruction tuning.
Reach us at info@study.space
[slides and audio] MAmmoTH2%3A Scaling Instructions from the Web