REBEL: Reinforcement Learning via Regressing Relative Rewards

REBEL: Reinforcement Learning via Regressing Relative Rewards

Dec 10, 2024 | Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun
REBEL is a reinforcement learning (RL) algorithm that simplifies policy optimization by regressing the relative reward between two completions of a prompt. Unlike traditional methods like PPO, which rely on complex components such as value networks and clipping, REBEL reduces the problem to a sequence of regression tasks, enabling lightweight implementation and strong theoretical guarantees. The algorithm is theoretically equivalent to Natural Policy Gradient (NPG) and can be applied to both language modeling and image generation. REBEL also incorporates offline data and handles intransitive preferences effectively. Empirically, REBEL outperforms PPO and DPO in language generation and achieves strong performance on benchmarks like AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard. It is also more computationally efficient and easier to implement than PPO. REBEL's key contribution is its ability to reduce RL to regression, making it scalable and effective for modern generative models. The algorithm is implemented in Python and available on GitHub and Hugging Face.REBEL is a reinforcement learning (RL) algorithm that simplifies policy optimization by regressing the relative reward between two completions of a prompt. Unlike traditional methods like PPO, which rely on complex components such as value networks and clipping, REBEL reduces the problem to a sequence of regression tasks, enabling lightweight implementation and strong theoretical guarantees. The algorithm is theoretically equivalent to Natural Policy Gradient (NPG) and can be applied to both language modeling and image generation. REBEL also incorporates offline data and handles intransitive preferences effectively. Empirically, REBEL outperforms PPO and DPO in language generation and achieves strong performance on benchmarks like AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard. It is also more computationally efficient and easier to implement than PPO. REBEL's key contribution is its ability to reduce RL to regression, making it scalable and effective for modern generative models. The algorithm is implemented in Python and available on GitHub and Hugging Face.
Reach us at info@study.space
[slides and audio] REBEL%3A Reinforcement Learning via Regressing Relative Rewards