DPO Meets PPO: Reinforced Token Optimization for RLHF

DPO Meets PPO: Reinforced Token Optimization for RLHF

April 2024; Revised: May 2025 | Han Zhong*, Zikang Shan†, Guhao Feng†, Wei Xiong‡, Xinle Cheng†, Li Zhao§, Di He†, Jiang Bian§, Liwei Wang†
This paper introduces Reinforced Token Optimization (RTO), a novel framework for Reinforcement Learning from Human Feedback (RLHF) that addresses the limitations of Proximal Policy Optimization (PPO) in handling token-level rewards. RTO models RLHF as a Markov Decision Process (MDP), enabling the capture of fine-grained token-wise information. The framework learns a token-wise reward function from preference data and performs policy optimization based on this signal. Theoretically, RTO is proven to find near-optimal policies efficiently. Practically, RTO integrates Direct Preference Optimization (DPO) and PPO, leveraging DPO's token-wise characterization of response quality to enhance PPO training. Extensive experiments show that RTO outperforms PPO and other direct preference learning algorithms, achieving significant improvements on benchmarks like AlpacaEval 2 and Arena-Hard. RTO also demonstrates strong data scaling properties, reaching PPO-level performance with less data and continuing to improve with more data. The paper also discusses related works, theoretical studies, and practical implementations of RTO, highlighting its advantages over traditional bandit formulations in RLHF.This paper introduces Reinforced Token Optimization (RTO), a novel framework for Reinforcement Learning from Human Feedback (RLHF) that addresses the limitations of Proximal Policy Optimization (PPO) in handling token-level rewards. RTO models RLHF as a Markov Decision Process (MDP), enabling the capture of fine-grained token-wise information. The framework learns a token-wise reward function from preference data and performs policy optimization based on this signal. Theoretically, RTO is proven to find near-optimal policies efficiently. Practically, RTO integrates Direct Preference Optimization (DPO) and PPO, leveraging DPO's token-wise characterization of response quality to enhance PPO training. Extensive experiments show that RTO outperforms PPO and other direct preference learning algorithms, achieving significant improvements on benchmarks like AlpacaEval 2 and Arena-Hard. RTO also demonstrates strong data scaling properties, reaching PPO-level performance with less data and continuing to improve with more data. The paper also discusses related works, theoretical studies, and practical implementations of RTO, highlighting its advantages over traditional bandit formulations in RLHF.
Reach us at info@futurestudyspace.com
[slides] DPO Meets PPO%3A Reinforced Token Optimization for RLHF | StudySpace