ORPO: Monolithic Preference Optimization without Reference Model

ORPO: Monolithic Preference Optimization without Reference Model

14 Mar 2024 | Jiwoo Hong, Noah Lee, James Thorne
This paper introduces ORPO (Odds Ratio Preference Optimization), a novel monolithic preference alignment method that eliminates the need for a reference model and a separate supervised fine-tuning (SFT) phase. ORPO incorporates an odds ratio-based penalty into the conventional negative log-likelihood (NLL) loss to differentiate between favored and disfavored responses. The authors demonstrate that a minor penalty for disfavored generation styles is sufficient for effective preference alignment during SFT. Empirical and theoretical analyses show that ORPO outperforms existing methods, including reinforcement learning with human feedback (RLHF) and direct policy optimization (DPO), in terms of instruction-following abilities and model performance on various benchmarks. Specifically, ORPO-trained models achieve superior results on AlpacaEval2.0, IFEval, and MT-Bench compared to state-of-the-art models with more than 7B and 13B parameters. The paper also provides a detailed theoretical and computational analysis of ORPO, highlighting its efficiency and effectiveness.This paper introduces ORPO (Odds Ratio Preference Optimization), a novel monolithic preference alignment method that eliminates the need for a reference model and a separate supervised fine-tuning (SFT) phase. ORPO incorporates an odds ratio-based penalty into the conventional negative log-likelihood (NLL) loss to differentiate between favored and disfavored responses. The authors demonstrate that a minor penalty for disfavored generation styles is sufficient for effective preference alignment during SFT. Empirical and theoretical analyses show that ORPO outperforms existing methods, including reinforcement learning with human feedback (RLHF) and direct policy optimization (DPO), in terms of instruction-following abilities and model performance on various benchmarks. Specifically, ORPO-trained models achieve superior results on AlpacaEval2.0, IFEval, and MT-Bench compared to state-of-the-art models with more than 7B and 13B parameters. The paper also provides a detailed theoretical and computational analysis of ORPO, highlighting its efficiency and effectiveness.
Reach us at info@study.space