Understanding WPO%3A Enhancing RLHF with Weighted Preference Optimization

The paper "WPO: Enhancing RLHF with Weighted Preference Optimization" addresses the distributional gap problem in off-policy preference optimization, which often leads to suboptimal performance in reinforcement learning from human feedback (RLHF). The authors propose Weighted Preference Optimization (WPO), a novel method that simulates on-policy learning using off-policy preference data. WPO reweights preference pairs according to their probability under the current policy, ensuring that more relevant and probable outputs are prioritized during optimization. This approach not only mitigates the distributional gap but also enhances the effectiveness of preference optimization without incurring additional costs. The paper evaluates WPO on instruction-following benchmarks, including Alpaca Eval 2 and MT-bench. Results show that WPO outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 and achieves a new state-of-the-art length-controlled winning rate of 48.6% against GPT-4-turbo on Alpaca Eval 2, making it the strongest 8B model on the leaderboard. The authors also find that WPO can be integrated into other loss functions for preference optimization and consistently improves performance across different settings. The contributions of the paper are threefold: 1. Identifying and addressing the distribution gap problem in off-policy preference optimization. 2. Proposing the WPO objective, which reweights preference pairs based on their probabilities to prioritize relevant and probable outputs. 3. Conducting extensive experiments on instruction-following benchmarks, demonstrating significant improvements over DPO and achieving new state-of-the-art results. The paper concludes by discussing the limitations of the approach, such as the performance gap between off-policy and on-policy preference optimization, and the need for more comprehensive preference datasets.The paper "WPO: Enhancing RLHF with Weighted Preference Optimization" addresses the distributional gap problem in off-policy preference optimization, which often leads to suboptimal performance in reinforcement learning from human feedback (RLHF). The authors propose Weighted Preference Optimization (WPO), a novel method that simulates on-policy learning using off-policy preference data. WPO reweights preference pairs according to their probability under the current policy, ensuring that more relevant and probable outputs are prioritized during optimization. This approach not only mitigates the distributional gap but also enhances the effectiveness of preference optimization without incurring additional costs. The paper evaluates WPO on instruction-following benchmarks, including Alpaca Eval 2 and MT-bench. Results show that WPO outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 and achieves a new state-of-the-art length-controlled winning rate of 48.6% against GPT-4-turbo on Alpaca Eval 2, making it the strongest 8B model on the leaderboard. The authors also find that WPO can be integrated into other loss functions for preference optimization and consistently improves performance across different settings. The contributions of the paper are threefold: 1. Identifying and addressing the distribution gap problem in off-policy preference optimization. 2. Proposing the WPO objective, which reweights preference pairs based on their probabilities to prioritize relevant and probable outputs. 3. Conducting extensive experiments on instruction-following benchmarks, demonstrating significant improvements over DPO and achieving new state-of-the-art results. The paper concludes by discussing the limitations of the approach, such as the performance gap between off-policy and on-policy preference optimization, and the need for more comprehensive preference datasets.

WPO: Enhancing RLHF with Weighted Preference Optimization

17 Jun 2024 | Wenxuan Zhou†, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu