**DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving**
**Authors:** Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He
**Institution:** Tsinghua University, Helixon Research, HKUST
**Abstract:**
Solving mathematical problems requires advanced reasoning abilities and presents significant challenges for large language models (LLMs). Previous works typically synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, these datasets often bias towards easy queries, with frequent failures to generate correct responses for the most challenging queries. To address this issue, we propose *Difficulty-Aware Rejection Tuning* (DART), a method that allocates more trials to difficult queries during the synthesis phase, enabling more extensive training on difficult samples. Using DART, we create new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Notably, our synthesis process relies solely on a 7B-sized open-weight model, without using proprietary models like GPT-4. We fine-tune various base models ranging from 7B to 70B in size, resulting in a series of strong models called DART-Math. In comprehensive evaluations on six mathematical benchmarks, DART-Math outperforms vanilla rejection tuning significantly, achieving superior or comparable results to previous methods despite using much smaller datasets and no proprietary models. Our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.
**Key Contributions:**
1. **DART (Difficulty-Aware Rejection Tuning):** A method that allocates more trials to difficult queries during the synthesis phase, enabling more extensive training on difficult samples.
2. **New Datasets:** Create new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones.
3. **Open-Weight Model:** Utilize a 7B-sized open-weight model, DeepSeekMath-7B-RL, for data synthesis, eliminating the reliance on proprietary models.
4. **Strong Models:** Create a series of strong models called DART-Math, achieving superior or comparable performance to previous best models on challenging benchmarks.
**Experiments:**
- **Setup:** Use DeepSeekMath-7B-RL to synthesize responses, perform standard instruction tuning on synthetic datasets, and evaluate on various benchmarks.
- **Results:** DART-Math outperforms vanilla rejection tuning and baselines on most benchmarks, demonstrating the effectiveness of difficulty-aware rejection sampling.
- **Analysis:** Study scaling behaviors, effect of one-response coverage, and synthesis cost, highlighting the benefits and limitations of DART.
**Discussion:**
- **Limitations:** Utilize fail rate as the difficulty metric, which may not be optimal. DART is limited by natural language reasoning, while code generation could potentially improve performance.
-**DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving**
**Authors:** Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He
**Institution:** Tsinghua University, Helixon Research, HKUST
**Abstract:**
Solving mathematical problems requires advanced reasoning abilities and presents significant challenges for large language models (LLMs). Previous works typically synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, these datasets often bias towards easy queries, with frequent failures to generate correct responses for the most challenging queries. To address this issue, we propose *Difficulty-Aware Rejection Tuning* (DART), a method that allocates more trials to difficult queries during the synthesis phase, enabling more extensive training on difficult samples. Using DART, we create new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Notably, our synthesis process relies solely on a 7B-sized open-weight model, without using proprietary models like GPT-4. We fine-tune various base models ranging from 7B to 70B in size, resulting in a series of strong models called DART-Math. In comprehensive evaluations on six mathematical benchmarks, DART-Math outperforms vanilla rejection tuning significantly, achieving superior or comparable results to previous methods despite using much smaller datasets and no proprietary models. Our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.
**Key Contributions:**
1. **DART (Difficulty-Aware Rejection Tuning):** A method that allocates more trials to difficult queries during the synthesis phase, enabling more extensive training on difficult samples.
2. **New Datasets:** Create new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones.
3. **Open-Weight Model:** Utilize a 7B-sized open-weight model, DeepSeekMath-7B-RL, for data synthesis, eliminating the reliance on proprietary models.
4. **Strong Models:** Create a series of strong models called DART-Math, achieving superior or comparable performance to previous best models on challenging benchmarks.
**Experiments:**
- **Setup:** Use DeepSeekMath-7B-RL to synthesize responses, perform standard instruction tuning on synthetic datasets, and evaluate on various benchmarks.
- **Results:** DART-Math outperforms vanilla rejection tuning and baselines on most benchmarks, demonstrating the effectiveness of difficulty-aware rejection sampling.
- **Analysis:** Study scaling behaviors, effect of one-response coverage, and synthesis cost, highlighting the benefits and limitations of DART.
**Discussion:**
- **Limitations:** Utilize fail rate as the difficulty metric, which may not be optimal. DART is limited by natural language reasoning, while code generation could potentially improve performance.
-