**Abstract:**
Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback to enhance simplicity and training stability. This work introduces SimPO, a simpler and more effective approach. The key design of SimPO is using the average log probability of a sequence as the implicit reward, which better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, a target reward margin is introduced to the Bradley-Terry objective to encourage a larger margin between winning and losing responses, further improving performance. SimPO is compared to DPO and its variants across various state-of-the-art training setups, including base and instruction-tuned models like Mistral and Llama3. Extensive evaluation on benchmarks such as AlpacaEval 2, MT-Bench, and Arena-Hard demonstrates that SimPO consistently outperforms existing approaches without significantly increasing response length. Specifically, SimPO achieves up to 6.4 points improvement on AlpacaEval 2 and 7.5 points on Arena-Hard compared to DPO.
**Introduction:**
Learning from human feedback is crucial for aligning large language models with human values and intentions. While classical RLHF methods have shown impressive results, they present optimization challenges due to their multi-stage procedure. SimPO addresses these challenges by aligning the reward function in preference optimization with the generation metric, eliminating the need for a reference model. The core of SimPO includes a length-normalized reward and a target reward margin to ensure a significant reward difference between winning and losing responses. Extensive analysis shows that SimPO effectively utilizes preference data, leading to more accurate likelihood rankings and better policy models. The top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 53.7 length-controlled win rate on AlpacaEval 2 and a 36.5 win rate on Arena-Hard, making it the strongest 8B open-source model to date.**Abstract:**
Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback to enhance simplicity and training stability. This work introduces SimPO, a simpler and more effective approach. The key design of SimPO is using the average log probability of a sequence as the implicit reward, which better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, a target reward margin is introduced to the Bradley-Terry objective to encourage a larger margin between winning and losing responses, further improving performance. SimPO is compared to DPO and its variants across various state-of-the-art training setups, including base and instruction-tuned models like Mistral and Llama3. Extensive evaluation on benchmarks such as AlpacaEval 2, MT-Bench, and Arena-Hard demonstrates that SimPO consistently outperforms existing approaches without significantly increasing response length. Specifically, SimPO achieves up to 6.4 points improvement on AlpacaEval 2 and 7.5 points on Arena-Hard compared to DPO.
**Introduction:**
Learning from human feedback is crucial for aligning large language models with human values and intentions. While classical RLHF methods have shown impressive results, they present optimization challenges due to their multi-stage procedure. SimPO addresses these challenges by aligning the reward function in preference optimization with the generation metric, eliminating the need for a reference model. The core of SimPO includes a length-normalized reward and a target reward margin to ensure a significant reward difference between winning and losing responses. Extensive analysis shows that SimPO effectively utilizes preference data, leading to more accurate likelihood rankings and better policy models. The top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 53.7 length-controlled win rate on AlpacaEval 2 and a 36.5 win rate on Arena-Hard, making it the strongest 8B open-source model to date.