15 Aug 2024 | Junru Lu, Jiazhen Li, Siyu An, Meng Zhao, Yulan He, Di Yin, Xing Sun
The paper addresses the issue of "verbosity" in Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. DPO, while offering a simpler alternative to Reinforcement Learning from Human Feedback (RLHF), suffers from over-optimization, where models generate longer responses that do not necessarily improve quality. The authors propose that this issue stems from an inherent algorithmic length reliance in DPO, where the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences can lead to biased rewards. They introduce SamPO, a downsampling approach that regularizes the KL divergence by down-sampling equal token-level features, effectively mitigating length reliance. Empirical evaluations across three LLMs and diverse datasets show that SamPO significantly reduces verbosity and improves overall performance by providing debiased rewards. The method achieves improvements of 5% to 12% over DPO, demonstrating its effectiveness in achieving more balanced and high-quality responses.The paper addresses the issue of "verbosity" in Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. DPO, while offering a simpler alternative to Reinforcement Learning from Human Feedback (RLHF), suffers from over-optimization, where models generate longer responses that do not necessarily improve quality. The authors propose that this issue stems from an inherent algorithmic length reliance in DPO, where the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences can lead to biased rewards. They introduce SamPO, a downsampling approach that regularizes the KL divergence by down-sampling equal token-level features, effectively mitigating length reliance. Empirical evaluations across three LLMs and diverse datasets show that SamPO significantly reduces verbosity and improves overall performance by providing debiased rewards. The method achieves improvements of 5% to 12% over DPO, demonstrating its effectiveness in achieving more balanced and high-quality responses.