The paper introduces EURUS, a suite of large language models (LLMs) optimized for reasoning tasks. EURUS models, fine-tuned from Mistral-7B and CodeLlama-70B, achieve state-of-the-art performance on various benchmarks covering mathematics, code generation, and logical reasoning. Notably, EURUS-70B outperforms GPT-3.5 Turbo in reasoning across 12 tests, achieving 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, significantly outperforming existing open-source models.
The key to EURUS's success is ULTRAINTERACT, a large-scale, high-quality alignment dataset designed for complex reasoning tasks. ULTRAINTERACT includes preference trees for each instruction, consisting of diverse planning strategies, multi-turn interaction trajectories, and pairwise data facilitating preference learning. This dataset is used for both supervised fine-tuning and preference learning, enhancing the models' reasoning capabilities.
The paper also explores different preference learning algorithms, finding that some established algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, the authors derive a novel reward modeling objective, leading to a strong reward model, EURUS-RM-7B, which achieves better correlation with human annotators than existing models on AutoJ and MT-Bench.
Overall, EURUS and ULTRAINTERACT push the boundaries of open-source reasoning generalists, demonstrating superior performance on challenging benchmarks and providing insights into preference learning for reasoning tasks.The paper introduces EURUS, a suite of large language models (LLMs) optimized for reasoning tasks. EURUS models, fine-tuned from Mistral-7B and CodeLlama-70B, achieve state-of-the-art performance on various benchmarks covering mathematics, code generation, and logical reasoning. Notably, EURUS-70B outperforms GPT-3.5 Turbo in reasoning across 12 tests, achieving 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, significantly outperforming existing open-source models.
The key to EURUS's success is ULTRAINTERACT, a large-scale, high-quality alignment dataset designed for complex reasoning tasks. ULTRAINTERACT includes preference trees for each instruction, consisting of diverse planning strategies, multi-turn interaction trajectories, and pairwise data facilitating preference learning. This dataset is used for both supervised fine-tuning and preference learning, enhancing the models' reasoning capabilities.
The paper also explores different preference learning algorithms, finding that some established algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, the authors derive a novel reward modeling objective, leading to a strong reward model, EURUS-RM-7B, which achieves better correlation with human annotators than existing models on AutoJ and MT-Bench.
Overall, EURUS and ULTRAINTERACT push the boundaries of open-source reasoning generalists, demonstrating superior performance on challenging benchmarks and providing insights into preference learning for reasoning tasks.