DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

18 Jun 2024 | Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving This paper introduces DART-Math, a method for improving mathematical problem-solving abilities of large language models (LLMs) through difficulty-aware rejection tuning. The key challenge in mathematical reasoning is the bias in existing datasets that favor easy queries, leading to poor performance on difficult problems. DART-Math addresses this by generating synthetic datasets that focus more on difficult queries, using a 7B-sized open-weight model instead of proprietary models like GPT-4. The method employs two strategies: Uniform, which collects the same number of correct responses for all queries, and Prop2Diff, which biases data samples towards difficult queries. These strategies result in synthetic datasets with around 590K examples, which are used to fine-tune various base models, creating a series of strong mathematical models called DART-Math. DART-Math outperforms vanilla rejection tuning and previous state-of-the-art models on six mathematical benchmarks, despite using smaller datasets and no proprietary models. The synthetic datasets are the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving. The method is evaluated on both in-domain and out-of-domain benchmarks, showing significant improvements on challenging tasks. The results demonstrate that difficulty-aware rejection sampling is effective in enhancing mathematical reasoning, with DART-Math achieving state-of-the-art performance on challenging benchmarks. The approach is open-sourced, making the datasets and models available for further research and development.DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving This paper introduces DART-Math, a method for improving mathematical problem-solving abilities of large language models (LLMs) through difficulty-aware rejection tuning. The key challenge in mathematical reasoning is the bias in existing datasets that favor easy queries, leading to poor performance on difficult problems. DART-Math addresses this by generating synthetic datasets that focus more on difficult queries, using a 7B-sized open-weight model instead of proprietary models like GPT-4. The method employs two strategies: Uniform, which collects the same number of correct responses for all queries, and Prop2Diff, which biases data samples towards difficult queries. These strategies result in synthetic datasets with around 590K examples, which are used to fine-tune various base models, creating a series of strong mathematical models called DART-Math. DART-Math outperforms vanilla rejection tuning and previous state-of-the-art models on six mathematical benchmarks, despite using smaller datasets and no proprietary models. The synthetic datasets are the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving. The method is evaluated on both in-domain and out-of-domain benchmarks, showing significant improvements on challenging tasks. The results demonstrate that difficulty-aware rejection sampling is effective in enhancing mathematical reasoning, with DART-Math achieving state-of-the-art performance on challenging benchmarks. The approach is open-sourced, making the datasets and models available for further research and development.
Reach us at info@study.space