Iterative Reasoning Preference Optimization

Iterative Reasoning Preference Optimization

26 Jun 2024 | Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston
This paper proposes an iterative training method called Iterative Reasoning Preference Optimization (Iterative RPO) to improve reasoning performance in large language models (LLMs). The method optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by focusing on the correctness of their final answers. It uses a modified DPO loss with an additional negative log-likelihood (NLL) term, which is crucial for performance. The approach iteratively generates multiple responses for each training prompt, constructs preference pairs based on the correctness of the final answers, and trains a new model using these pairs. The process is repeated until the model's reasoning performance saturates. The method is tested on three reasoning tasks: GSM8K, ARC-Challenge, and MATH. On GSM8K, the model achieves an accuracy of 81.6% after iterative training, outperforming other Llama-2-based models. On ARC-Challenge, the model achieves 86.7% accuracy without using the provided dataset. On MATH, the model achieves 20.8% accuracy, outperforming other baselines. The results show that the method significantly improves reasoning performance across multiple iterations, with the NLL term playing a crucial role in enhancing model performance. The approach does not require human-in-the-loop or additional training data, making it simple and efficient to implement. It is shown to be effective in improving the reasoning capabilities of LLMs across a range of tasks. The method is related to other iterative training approaches, such as Self-Rewarding LLMs and STaR, but differs in its use of preference optimization and the inclusion of the NLL term. The paper also discusses related work and highlights the effectiveness of the proposed method in improving reasoning performance.This paper proposes an iterative training method called Iterative Reasoning Preference Optimization (Iterative RPO) to improve reasoning performance in large language models (LLMs). The method optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by focusing on the correctness of their final answers. It uses a modified DPO loss with an additional negative log-likelihood (NLL) term, which is crucial for performance. The approach iteratively generates multiple responses for each training prompt, constructs preference pairs based on the correctness of the final answers, and trains a new model using these pairs. The process is repeated until the model's reasoning performance saturates. The method is tested on three reasoning tasks: GSM8K, ARC-Challenge, and MATH. On GSM8K, the model achieves an accuracy of 81.6% after iterative training, outperforming other Llama-2-based models. On ARC-Challenge, the model achieves 86.7% accuracy without using the provided dataset. On MATH, the model achieves 20.8% accuracy, outperforming other baselines. The results show that the method significantly improves reasoning performance across multiple iterations, with the NLL term playing a crucial role in enhancing model performance. The approach does not require human-in-the-loop or additional training data, making it simple and efficient to implement. It is shown to be effective in improving the reasoning capabilities of LLMs across a range of tasks. The method is related to other iterative training approaches, such as Self-Rewarding LLMs and STaR, but differs in its use of preference optimization and the inclusion of the NLL term. The paper also discusses related work and highlights the effectiveness of the proposed method in improving reasoning performance.
Reach us at info@study.space