Direct Preference Optimization with an Offset

Direct Preference Optimization with an Offset

6 Jun 2024 | Afra Amini, Tim Vieira, Ryan Cotterell
This paper introduces a novel method called Direct Preference Optimization with an Offset (ODPO), which is an extension of Direct Preference Optimization (DPO). DPO is a successful approach for aligning large language models with human preferences without the need for a reward model or reinforcement learning. However, DPO treats all preference pairs equally, regardless of the extent to which one response is preferred over another. ODPO addresses this limitation by incorporating an offset into the loss function, which depends on the difference between the estimated rewards of the preferred and dispreferred responses. The offset ensures that the preferred response is more likely than the dispreferred response by a value determined by the actual reward difference. Experiments on various tasks, including sentiment control, toxicity control, and summarization, demonstrate that ODPO outperforms DPO in aligning language models with human preferences, especially when the number of preference pairs is limited. The results show that ODPO achieves higher rewards and better KL divergence compared to DPO, indicating its effectiveness in generating responses that align with human preferences.This paper introduces a novel method called Direct Preference Optimization with an Offset (ODPO), which is an extension of Direct Preference Optimization (DPO). DPO is a successful approach for aligning large language models with human preferences without the need for a reward model or reinforcement learning. However, DPO treats all preference pairs equally, regardless of the extent to which one response is preferred over another. ODPO addresses this limitation by incorporating an offset into the loss function, which depends on the difference between the estimated rewards of the preferred and dispreferred responses. The offset ensures that the preferred response is more likely than the dispreferred response by a value determined by the actual reward difference. Experiments on various tasks, including sentiment control, toxicity control, and summarization, demonstrate that ODPO outperforms DPO in aligning language models with human preferences, especially when the number of preference pairs is limited. The results show that ODPO achieves higher rewards and better KL divergence compared to DPO, indicating its effectiveness in generating responses that align with human preferences.
Reach us at info@study.space