28 Mar 2024 | Ryan Park*, Rafael Raffailov*, Stefano Ermon, Chelsea Finn
The paper addresses the issue of length exploitation in Direct Preference Optimization (DPO), a method used in Reinforcement Learning from Human Feedback (RLHF) to train large language models. RLHF has been crucial for improving the capabilities of LLMs, but it is known to exploit biases in human preferences, such as verbosity. DPO, which does not train a separate reward model or use reinforcement learning directly, has not been previously explored for controlling these biases. The authors study the length problem in the DPO setting, showing significant exploitation in DPO and linking it to out-of-distribution bootstrapping. They develop a principled but simple regularization strategy that prevents length exploitation while maintaining improvements in model quality. The approach is evaluated on summarization and dialogue datasets, achieving up to 20% improvement in win rates when controlling for length, despite the GPT4 judge's well-known verbosity bias. The paper also discusses the relationship between length and quality, the effectiveness of the regularization strategy, and the potential causes of length exploitation.The paper addresses the issue of length exploitation in Direct Preference Optimization (DPO), a method used in Reinforcement Learning from Human Feedback (RLHF) to train large language models. RLHF has been crucial for improving the capabilities of LLMs, but it is known to exploit biases in human preferences, such as verbosity. DPO, which does not train a separate reward model or use reinforcement learning directly, has not been previously explored for controlling these biases. The authors study the length problem in the DPO setting, showing significant exploitation in DPO and linking it to out-of-distribution bootstrapping. They develop a principled but simple regularization strategy that prevents length exploitation while maintaining improvements in model quality. The approach is evaluated on summarization and dialogue datasets, achieving up to 20% improvement in win rates when controlling for length, despite the GPT4 judge's well-known verbosity bias. The paper also discusses the relationship between length and quality, the effectiveness of the regularization strategy, and the potential causes of length exploitation.