5 Jun 2024 | Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, Minlie Huang
This article introduces *efficient exact optimization (EXO)* for aligning language models with human preferences. The alignment problem is formulated as optimizing a policy to maximize expected reward while minimizing the reverse KL divergence from an initial policy. While reinforcement learning (RL) is a common approach, it suffers from high variance and inefficiency. Direct preference optimization (DPO) was proposed as an alternative, but it is shown to produce a compromised approximation of the optimal policy. EXO is proposed as a more effective method that guarantees optimization in the same direction as RL algorithms asymptotically, enabling efficient optimization without the complexities of RL. Theoretical and empirical analyses demonstrate that EXO outperforms DPO and PPO on realistic human preference data. EXO is based on probability matching between the policy and the optimal policy, minimizing the reverse KL divergence. It is shown that DPO corresponds to minimizing the forward KL divergence, which is less effective in capturing the essential characteristics of the optimal policy. The paper also presents experiments showing that EXO achieves better performance in tasks such as summarization, dialogue generation, and instruction following. The results highlight the effectiveness of EXO in aligning language models with human preferences.This article introduces *efficient exact optimization (EXO)* for aligning language models with human preferences. The alignment problem is formulated as optimizing a policy to maximize expected reward while minimizing the reverse KL divergence from an initial policy. While reinforcement learning (RL) is a common approach, it suffers from high variance and inefficiency. Direct preference optimization (DPO) was proposed as an alternative, but it is shown to produce a compromised approximation of the optimal policy. EXO is proposed as a more effective method that guarantees optimization in the same direction as RL algorithms asymptotically, enabling efficient optimization without the complexities of RL. Theoretical and empirical analyses demonstrate that EXO outperforms DPO and PPO on realistic human preference data. EXO is based on probability matching between the policy and the optimal policy, minimizing the reverse KL divergence. It is shown that DPO corresponds to minimizing the forward KL divergence, which is less effective in capturing the essential characteristics of the optimal policy. The paper also presents experiments showing that EXO achieves better performance in tasks such as summarization, dialogue generation, and instruction following. The results highlight the effectiveness of EXO in aligning language models with human preferences.