Iterative Reasoning Preference Optimization

Iterative Reasoning Preference Optimization

26 Jun 2024 | Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston
This paper introduces an iterative approach to optimize the preference between competing Chain-of-Thought (CoT) candidates, focusing on improving reasoning tasks. The method, called Iterative Reasoning Preference Optimization (Iterative RPO), uses a modified DPO loss with an additional negative log-likelihood (NLL) term to train the model. Each iteration generates multiple CoT steps and answers for training prompts, constructs preference pairs based on the correctness of the final answers, and trains the model using these pairs. The process is repeated, improving reasoning performance over multiple iterations. The approach outperforms various baselines, including zero-shot CoT, supervised fine-tuning (SFT), and standard DPO, on datasets such as GSM8K, MATH, and ARC-Challenge. The NLL loss term is crucial for performance, as demonstrated through ablations and analysis of training sequence probabilities. The method is simple, efficient, and effective in enhancing the reasoning capabilities of large language models (LLMs).This paper introduces an iterative approach to optimize the preference between competing Chain-of-Thought (CoT) candidates, focusing on improving reasoning tasks. The method, called Iterative Reasoning Preference Optimization (Iterative RPO), uses a modified DPO loss with an additional negative log-likelihood (NLL) term to train the model. Each iteration generates multiple CoT steps and answers for training prompts, constructs preference pairs based on the correctness of the final answers, and trains the model using these pairs. The process is repeated, improving reasoning performance over multiple iterations. The approach outperforms various baselines, including zero-shot CoT, supervised fine-tuning (SFT), and standard DPO, on datasets such as GSM8K, MATH, and ARC-Challenge. The NLL loss term is crucial for performance, as demonstrated through ablations and analysis of training sequence probabilities. The method is simple, efficient, and effective in enhancing the reasoning capabilities of large language models (LLMs).
Reach us at info@study.space
[slides and audio] Iterative Reasoning Preference Optimization