Understanding From %24r%24 to %24Q%5E*%24%3A Your Language Model is Secretly a Q-Function

This paper addresses the mismatch between traditional Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), a direct alignment method. The authors theoretically show that DPO can be derived within the token-level Markov Decision Process (MDP) as a general inverse Q-learning algorithm, satisfying the Bellman equation. They provide three empirical insights based on their theoretical results: 1. **Credit Assignment**: DPO, despite being derived as a contextual bandit, can perform credit assignment at the token level. 2. **Search-Based Algorithms**: Classical search-based algorithms, such as Monte Carlo Tree Search (MCTS), are equivalent to likelihood-based search on a DPO policy during decoding. 3. **Implicit Rewards**: The choice of initial policy and reference distribution significantly affects the trajectory of implicit rewards during training. The paper also discusses the implications of these findings for various applications, including information elicitation in multi-turn dialogue, reasoning, agentic applications, and end-to-end training of multi-model systems. The authors conclude by highlighting promising future directions, such as learning intermediate reasoning from outcome feedback, multi-turn conversations, agentic LLMs, and end-to-end training of generative AI systems.This paper addresses the mismatch between traditional Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), a direct alignment method. The authors theoretically show that DPO can be derived within the token-level Markov Decision Process (MDP) as a general inverse Q-learning algorithm, satisfying the Bellman equation. They provide three empirical insights based on their theoretical results: 1. **Credit Assignment**: DPO, despite being derived as a contextual bandit, can perform credit assignment at the token level. 2. **Search-Based Algorithms**: Classical search-based algorithms, such as Monte Carlo Tree Search (MCTS), are equivalent to likelihood-based search on a DPO policy during decoding. 3. **Implicit Rewards**: The choice of initial policy and reference distribution significantly affects the trajectory of implicit rewards during training. The paper also discusses the implications of these findings for various applications, including information elicitation in multi-turn dialogue, reasoning, agentic applications, and end-to-end training of multi-model systems. The authors conclude by highlighting promising future directions, such as learning intermediate reasoning from outcome feedback, multi-turn conversations, agentic LLMs, and end-to-end training of generative AI systems.

From r to Q*: Your Language Model is Secretly a Q-Function

12 Aug 2024 | Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

From r to Q*: Your Language Model is Secretly a Q-Function

12 Aug 2024 | Rafael Rafailov*, Joey Hejna*, Ryan Park, Chelsea Finn

12 Aug 2024 | Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn