15 Feb 2024 | Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, Igor Gitman
OpenMathInstruct-1 is a large math instruction tuning dataset containing 1.8 million problem-solution pairs. It was created using the Mixtral model, which is permissively licensed, to generate code-interpreter style solutions for the GSM8K and MATH benchmarks. The dataset was synthesized by generating multiple solutions for each problem, with a focus on improving training set coverage. The resulting dataset has high coverage for both benchmarks, with 93% for MATH and 99.9% for GSM8K. The dataset is publicly available under a commercially permissive license, allowing unrestricted use. The dataset was used to train several models, including OpenMath-CodeLlama-70B, which achieved competitive performance on the GSM8K and MATH benchmarks. The dataset also includes a large number of incorrect sampled solutions, which were included to support the open-source efforts in this area. The dataset was created through a combination of prompting strategies, including subject-specific prompts and masked text solutions, to improve the diversity and quality of the generated solutions. The dataset was also processed to remove syntactically noisy solutions and to ensure a balanced representation of problems. The results show that the dataset outperforms existing math instruction tuning datasets in terms of size and quality, and that the models trained on it achieve competitive performance on the GSM8K and MATH benchmarks.OpenMathInstruct-1 is a large math instruction tuning dataset containing 1.8 million problem-solution pairs. It was created using the Mixtral model, which is permissively licensed, to generate code-interpreter style solutions for the GSM8K and MATH benchmarks. The dataset was synthesized by generating multiple solutions for each problem, with a focus on improving training set coverage. The resulting dataset has high coverage for both benchmarks, with 93% for MATH and 99.9% for GSM8K. The dataset is publicly available under a commercially permissive license, allowing unrestricted use. The dataset was used to train several models, including OpenMath-CodeLlama-70B, which achieved competitive performance on the GSM8K and MATH benchmarks. The dataset also includes a large number of incorrect sampled solutions, which were included to support the open-source efforts in this area. The dataset was created through a combination of prompting strategies, including subject-specific prompts and masked text solutions, to improve the diversity and quality of the generated solutions. The dataset was also processed to remove syntactically noisy solutions and to ensure a balanced representation of problems. The results show that the dataset outperforms existing math instruction tuning datasets in terms of size and quality, and that the models trained on it achieve competitive performance on the GSM8K and MATH benchmarks.