Weighted Preference Optimization (WPO) is a novel method to enhance Reinforcement Learning from Human Feedback (RLHF) by addressing the distributional gap between the policy used for data collection and the target policy. The method simulates on-policy learning using off-policy preference data by reweighting preference pairs based on their probability under the current policy. This approach not only mitigates the distributional gap but also improves the optimization process without additional costs. WPO outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 and achieves a 48.6% length-controlled winning rate against GPT-4-turbo on the same benchmark, making it the strongest 8B model on the leaderboard. The method is validated on instruction following benchmarks and shows consistent improvements across different loss functions for preference optimization. WPO also demonstrates better performance in hybrid RL settings, combining on-policy and off-policy data. The results indicate that WPO provides universal improvements across different loss functions for preference optimization and enhances the effectiveness of preference optimization. The method is implemented with a weighted preference optimization objective, where different preference pairs are reweighted based on their probability under the current policy. The results show that WPO consistently outperforms DPO and its variants, achieving new state-of-the-art results on Alpaca Eval 2. The method is also effective in reducing the performance gap between off-policy and on-policy preference optimization, although it does not fully bridge the gap. The findings suggest that using on-policy, dispreferred data is important for preference optimization, while using on-policy preferred data may be beneficial but not as critical. The method is implemented with a weighted preference optimization objective, where different preference pairs are reweighted based on their probability under the current policy. The results show that WPO consistently outperforms DPO and its variants, achieving new state-of-the-art results on Alpaca Eval 2. The method is also effective in reducing the performance gap between off-policy and on-policy preference optimization, although it does not fully bridge the gap. The findings suggest that using on-policy, dispreferred data is important for preference optimization, while using on-policy preferred data may be beneficial but not as critical.Weighted Preference Optimization (WPO) is a novel method to enhance Reinforcement Learning from Human Feedback (RLHF) by addressing the distributional gap between the policy used for data collection and the target policy. The method simulates on-policy learning using off-policy preference data by reweighting preference pairs based on their probability under the current policy. This approach not only mitigates the distributional gap but also improves the optimization process without additional costs. WPO outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 and achieves a 48.6% length-controlled winning rate against GPT-4-turbo on the same benchmark, making it the strongest 8B model on the leaderboard. The method is validated on instruction following benchmarks and shows consistent improvements across different loss functions for preference optimization. WPO also demonstrates better performance in hybrid RL settings, combining on-policy and off-policy data. The results indicate that WPO provides universal improvements across different loss functions for preference optimization and enhances the effectiveness of preference optimization. The method is implemented with a weighted preference optimization objective, where different preference pairs are reweighted based on their probability under the current policy. The results show that WPO consistently outperforms DPO and its variants, achieving new state-of-the-art results on Alpaca Eval 2. The method is also effective in reducing the performance gap between off-policy and on-policy preference optimization, although it does not fully bridge the gap. The findings suggest that using on-policy, dispreferred data is important for preference optimization, while using on-policy preferred data may be beneficial but not as critical. The method is implemented with a weighted preference optimization objective, where different preference pairs are reweighted based on their probability under the current policy. The results show that WPO consistently outperforms DPO and its variants, achieving new state-of-the-art results on Alpaca Eval 2. The method is also effective in reducing the performance gap between off-policy and on-policy preference optimization, although it does not fully bridge the gap. The findings suggest that using on-policy, dispreferred data is important for preference optimization, while using on-policy preferred data may be beneficial but not as critical.