15 Aug 2024 | Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di Yin, Xing Sun
This paper addresses the issue of "verbosity" in Direct Preference Optimization (DPO), a method for aligning large language models (LLMs) with human preferences. The authors argue that DPO's verbosity stems not only from biased labels in the data but also from an inherent algorithmic reliance on response length. Specifically, the discrepancy in sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences leads to overestimated or underestimated rewards due to varying token lengths. To mitigate this, the authors propose SamPO, a down-sampled version of DPO that reduces length reliance by applying regularized KL divergence at the token level.
The paper evaluates SamPO across three LLMs of varying scales and a diverse set of benchmarks, demonstrating that it significantly reduces verbosity compared to DPO, achieving improvements of 5% to 12% in performance. The authors also show that SamPO produces more balanced rewards, leading to shorter and more effective responses. The method is validated through extensive experiments, including comparisons with other baselines such as Hybrid DPO+SFT, TDPO, and SimPO, where SamPO consistently outperforms these methods in terms of both accuracy and response length.
The study highlights the importance of addressing length bias in preference optimization and provides a practical solution through the SamPO method. The results demonstrate that by down-sampling token-level features, the algorithm can effectively reduce the influence of response length on reward calculation, leading to more accurate and efficient alignment with human preferences. The paper concludes that SamPO is a promising approach for improving the performance and efficiency of DPO in aligning LLMs with human preferences.This paper addresses the issue of "verbosity" in Direct Preference Optimization (DPO), a method for aligning large language models (LLMs) with human preferences. The authors argue that DPO's verbosity stems not only from biased labels in the data but also from an inherent algorithmic reliance on response length. Specifically, the discrepancy in sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences leads to overestimated or underestimated rewards due to varying token lengths. To mitigate this, the authors propose SamPO, a down-sampled version of DPO that reduces length reliance by applying regularized KL divergence at the token level.
The paper evaluates SamPO across three LLMs of varying scales and a diverse set of benchmarks, demonstrating that it significantly reduces verbosity compared to DPO, achieving improvements of 5% to 12% in performance. The authors also show that SamPO produces more balanced rewards, leading to shorter and more effective responses. The method is validated through extensive experiments, including comparisons with other baselines such as Hybrid DPO+SFT, TDPO, and SimPO, where SamPO consistently outperforms these methods in terms of both accuracy and response length.
The study highlights the importance of addressing length bias in preference optimization and provides a practical solution through the SamPO method. The results demonstrate that by down-sampling token-level features, the algorithm can effectively reduce the influence of response length on reward calculation, leading to more accurate and efficient alignment with human preferences. The paper concludes that SamPO is a promising approach for improving the performance and efficiency of DPO in aligning LLMs with human preferences.