Advancing LLM Reasoning Generalists with Preference Trees

Advancing LLM Reasoning Generalists with Preference Trees

2 Apr 2024 | Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun
The paper introduces EURUS, a suite of large language models (LLMs) optimized for reasoning, which achieves state-of-the-art results on a diverse set of benchmarks including mathematics, code generation, and logical reasoning. EURUS is fine-tuned from Mistral-7B and CodeLlama-70B, and is trained on ULTRAINTERACT, a newly curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. ULTRAINTERACT includes a preference tree for each instruction, consisting of reasoning chains with diverse planning strategies, multi-turn interaction trajectories with the environment and critique, and pairwise data to facilitate preference learning. The dataset enables in-depth exploration of preference learning for reasoning tasks and leads to a strong reward model, EURUS-RM-7B, which achieves better correlation with human annotators than existing models. EURUS-70B outperforms GPT-3.5 Turbo in reasoning, achieving 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA. The paper also explores preference learning algorithms, finding that some well-established algorithms may be less suitable for reasoning tasks compared to general conversations. It introduces a novel reward modeling objective that enhances the performance of the reward model. EURUS-RM-7B demonstrates strong preference modeling performance on reasoning tasks and improves LLMs' reasoning capabilities through reranking. The paper concludes that ULTRAINTERACT is a valuable ingredient for training data mixture in reward modeling and that the performance in reasoning correlates with the value of rewards of chosen data.The paper introduces EURUS, a suite of large language models (LLMs) optimized for reasoning, which achieves state-of-the-art results on a diverse set of benchmarks including mathematics, code generation, and logical reasoning. EURUS is fine-tuned from Mistral-7B and CodeLlama-70B, and is trained on ULTRAINTERACT, a newly curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. ULTRAINTERACT includes a preference tree for each instruction, consisting of reasoning chains with diverse planning strategies, multi-turn interaction trajectories with the environment and critique, and pairwise data to facilitate preference learning. The dataset enables in-depth exploration of preference learning for reasoning tasks and leads to a strong reward model, EURUS-RM-7B, which achieves better correlation with human annotators than existing models. EURUS-70B outperforms GPT-3.5 Turbo in reasoning, achieving 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA. The paper also explores preference learning algorithms, finding that some well-established algorithms may be less suitable for reasoning tasks compared to general conversations. It introduces a novel reward modeling objective that enhances the performance of the reward model. EURUS-RM-7B demonstrates strong preference modeling performance on reasoning tasks and improves LLMs' reasoning capabilities through reranking. The paper concludes that ULTRAINTERACT is a valuable ingredient for training data mixture in reward modeling and that the performance in reasoning correlates with the value of rewards of chosen data.
Reach us at info@study.space