ORPO: Monolithic Preference Optimization without Reference Model

ORPO: Monolithic Preference Optimization without Reference Model

14 Mar 2024 | Jiwoo Hong, Noah Lee, James Thorne
This paper introduces ORPO, a reference-free monolithic preference optimization algorithm for language models. ORPO eliminates the need for an additional preference alignment phase by incorporating an odds ratio-based penalty into the conventional negative log-likelihood (NLL) loss. This approach allows the model to learn desired generation styles while avoiding undesirable ones during supervised fine-tuning (SFT). The odds ratio is used to contrast favored and disfavored styles, making it a sensible choice for preference alignment across various model sizes. ORPO is evaluated on multiple benchmarks, including AlpacaEval and IFEval, and outperforms state-of-the-art models with more than 7B parameters. Specifically, Mistral-ORPO-α and Mistral-ORPO-β achieve 11.33% and 12.20% on AlpacaEval 2.0, and 7.23 and 7.32 on MT-Bench. These results demonstrate the effectiveness of ORPO in preference alignment without requiring a reference model or additional training phases. ORPO is also compared to other preference alignment methods such as RLHF and DPO. The results show that ORPO achieves higher win rates against these methods, particularly for larger models. Additionally, ORPO is computationally more efficient than RLHF and DPO, as it does not require a reference model and has fewer forward passes per batch. The paper also discusses the theoretical and computational aspects of ORPO, showing that the odds ratio is a better choice for preference alignment than the probability ratio due to its sensitivity to the model's preference understanding. ORPO is shown to effectively preserve the domain adaptation role of SFT while penalizing unwanted generation styles. The results demonstrate that ORPO is a simple and effective method for preference alignment, achieving high performance on multiple benchmarks and outperforming other methods in terms of efficiency and effectiveness. The code and model checkpoints for Mistral-ORPO-α and Mistral-ORPO-β are released to aid reproducibility.This paper introduces ORPO, a reference-free monolithic preference optimization algorithm for language models. ORPO eliminates the need for an additional preference alignment phase by incorporating an odds ratio-based penalty into the conventional negative log-likelihood (NLL) loss. This approach allows the model to learn desired generation styles while avoiding undesirable ones during supervised fine-tuning (SFT). The odds ratio is used to contrast favored and disfavored styles, making it a sensible choice for preference alignment across various model sizes. ORPO is evaluated on multiple benchmarks, including AlpacaEval and IFEval, and outperforms state-of-the-art models with more than 7B parameters. Specifically, Mistral-ORPO-α and Mistral-ORPO-β achieve 11.33% and 12.20% on AlpacaEval 2.0, and 7.23 and 7.32 on MT-Bench. These results demonstrate the effectiveness of ORPO in preference alignment without requiring a reference model or additional training phases. ORPO is also compared to other preference alignment methods such as RLHF and DPO. The results show that ORPO achieves higher win rates against these methods, particularly for larger models. Additionally, ORPO is computationally more efficient than RLHF and DPO, as it does not require a reference model and has fewer forward passes per batch. The paper also discusses the theoretical and computational aspects of ORPO, showing that the odds ratio is a better choice for preference alignment than the probability ratio due to its sensitivity to the model's preference understanding. ORPO is shown to effectively preserve the domain adaptation role of SFT while penalizing unwanted generation styles. The results demonstrate that ORPO is a simple and effective method for preference alignment, achieving high performance on multiple benchmarks and outperforming other methods in terms of efficiency and effectiveness. The code and model checkpoints for Mistral-ORPO-α and Mistral-ORPO-β are released to aid reproducibility.
Reach us at info@study.space
Understanding ORPO%3A Monolithic Preference Optimization without Reference Model