SimPO: Simple Preference Optimization with a Reference-Free Reward

SimPO: Simple Preference Optimization with a Reference-Free Reward

8 Jul 2024 | Yu Meng, Mengzhou Xia, Danqi Chen
SimPO is a simple and effective offline preference optimization algorithm that outperforms existing methods in various benchmarks. It uses the average log probability of a sequence as the implicit reward, eliminating the need for a reference model and improving compute and memory efficiency. SimPO introduces a target reward margin to enhance performance by encouraging a larger margin between winning and losing responses. It is compared to DPO and its variants across multiple training setups, including base and instruction-tuned models like Mistral and Llama3. SimPO achieves significant improvements on benchmarks such as AlpacaEval 2 and Arena-Hard, with a 6.4-point improvement on AlpacaEval 2 and a 7.5-point improvement on Arena-Hard. The top-performing model, built on Llama3-8B-Instruct, achieves a 53.7 length-controlled win rate on AlpacaEval 2 and a 36.5 win rate on Arena-Hard, surpassing other models. SimPO's key designs include length-normalized reward and a target reward margin, which enhance performance and reduce length exploitation. SimPO is more memory and compute-efficient than DPO, with a 20% reduction in run time and 10% reduction in GPU memory usage. It also aligns the reward function with the generation likelihood, improving generalization and effectiveness. SimPO outperforms DPO in terms of reward accuracy and efficiency, and its performance is validated across various benchmarks and settings. The algorithm is effective in aligning large language models with human preferences and values, and it shows promise in safety and honesty aspects. However, further research is needed to address theoretical understanding and potential performance drops on certain tasks.SimPO is a simple and effective offline preference optimization algorithm that outperforms existing methods in various benchmarks. It uses the average log probability of a sequence as the implicit reward, eliminating the need for a reference model and improving compute and memory efficiency. SimPO introduces a target reward margin to enhance performance by encouraging a larger margin between winning and losing responses. It is compared to DPO and its variants across multiple training setups, including base and instruction-tuned models like Mistral and Llama3. SimPO achieves significant improvements on benchmarks such as AlpacaEval 2 and Arena-Hard, with a 6.4-point improvement on AlpacaEval 2 and a 7.5-point improvement on Arena-Hard. The top-performing model, built on Llama3-8B-Instruct, achieves a 53.7 length-controlled win rate on AlpacaEval 2 and a 36.5 win rate on Arena-Hard, surpassing other models. SimPO's key designs include length-normalized reward and a target reward margin, which enhance performance and reduce length exploitation. SimPO is more memory and compute-efficient than DPO, with a 20% reduction in run time and 10% reduction in GPU memory usage. It also aligns the reward function with the generation likelihood, improving generalization and effectiveness. SimPO outperforms DPO in terms of reward accuracy and efficiency, and its performance is validated across various benchmarks and settings. The algorithm is effective in aligning large language models with human preferences and values, and it shows promise in safety and honesty aspects. However, further research is needed to address theoretical understanding and potential performance drops on certain tasks.
Reach us at info@study.space