DPO Meets PPO: Reinforced Token Optimization for RLHF

DPO Meets PPO: Reinforced Token Optimization for RLHF

April 2024; Revised: May 2025 | Han Zhong*, Zikang Shan†, Guhao Feng†, Wei Xiong‡, Xinle Cheng†, Li Zhao§, Di He†, Jiang Bian§, Liwei Wang†
The paper introduces a novel framework for Reinforcement Learning from Human Feedback (RLHF) that models the problem as a Markov Decision Process (MDP), enabling the capture of fine-grained token-wise information. This approach contrasts with the traditional sentence-level bandit formulation, which is known to be suboptimal for large language models (LLMs). The key contribution is the Reinforced Token Optimization (RTO) algorithm, which learns token-wise reward functions from preference data and uses them to optimize policies with Proximal Policy Optimization (PPO). Theoretical analysis shows that RTO can find near-optimal policies sample-efficiently. Practical experiments demonstrate that RTO outperforms PPO and other direct preference learning algorithms, achieving significant improvements on benchmarks like AlpacaEval 2 and Arena-Hard. RTO's performance is attributed to its ability to provide dense token-wise rewards, which enhance the effectiveness of PPO. Additionally, RTO exhibits superior data scaling properties, requiring less data to achieve comparable performance to PPO and continuing to improve with more data.The paper introduces a novel framework for Reinforcement Learning from Human Feedback (RLHF) that models the problem as a Markov Decision Process (MDP), enabling the capture of fine-grained token-wise information. This approach contrasts with the traditional sentence-level bandit formulation, which is known to be suboptimal for large language models (LLMs). The key contribution is the Reinforced Token Optimization (RTO) algorithm, which learns token-wise reward functions from preference data and uses them to optimize policies with Proximal Policy Optimization (PPO). Theoretical analysis shows that RTO can find near-optimal policies sample-efficiently. Practical experiments demonstrate that RTO outperforms PPO and other direct preference learning algorithms, achieving significant improvements on benchmarks like AlpacaEval 2 and Arena-Hard. RTO's performance is attributed to its ability to provide dense token-wise rewards, which enhance the effectiveness of PPO. Additionally, RTO exhibits superior data scaling properties, requiring less data to achieve comparable performance to PPO and continuing to improve with more data.
Reach us at info@study.space