The paper introduces EURUS, a suite of large language models (LLMs) optimized for reasoning, which achieves state-of-the-art results on a diverse set of benchmarks including mathematics, code generation, and logical reasoning. EURUS is fine-tuned from Mistral-7B and CodeLlama-70B, and is trained on ULTRAINTERACT, a newly curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. ULTRAINTERACT includes a preference tree for each instruction, consisting of reasoning chains with diverse planning strategies, multi-turn interaction trajectories with the environment and critique, and pairwise data to facilitate preference learning. The dataset enables in-depth exploration of preference learning for reasoning tasks and leads to a strong reward model, EURUS-RM-7B, which achieves better correlation with human annotators than existing models. EURUS-70B outperforms GPT-3.5 Turbo in reasoning, achieving 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA. The paper also explores preference learning algorithms, finding that some well-established algorithms may be less suitable for reasoning tasks compared to general conversations. It introduces a novel reward modeling objective that enhances the performance of the reward model. EURUS-RM-7B demonstrates strong preference modeling performance on reasoning tasks and improves LLMs' reasoning capabilities through reranking. The paper concludes that ULTRAINTERACT is a valuable ingredient for training data mixture in reward modeling and that the performance in reasoning correlates with the value of rewards of chosen data.The paper introduces EURUS, a suite of large language models (LLMs) optimized for reasoning, which achieves state-of-the-art results on a diverse set of benchmarks including mathematics, code generation, and logical reasoning. EURUS is fine-tuned from Mistral-7B and CodeLlama-70B, and is trained on ULTRAINTERACT, a newly curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. ULTRAINTERACT includes a preference tree for each instruction, consisting of reasoning chains with diverse planning strategies, multi-turn interaction trajectories with the environment and critique, and pairwise data to facilitate preference learning. The dataset enables in-depth exploration of preference learning for reasoning tasks and leads to a strong reward model, EURUS-RM-7B, which achieves better correlation with human annotators than existing models. EURUS-70B outperforms GPT-3.5 Turbo in reasoning, achieving 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA. The paper also explores preference learning algorithms, finding that some well-established algorithms may be less suitable for reasoning tasks compared to general conversations. It introduces a novel reward modeling objective that enhances the performance of the reward model. EURUS-RM-7B demonstrates strong preference modeling performance on reasoning tasks and improves LLMs' reasoning capabilities through reranking. The paper concludes that ULTRAINTERACT is a valuable ingredient for training data mixture in reward modeling and that the performance in reasoning correlates with the value of rewards of chosen data.