From r to Q*: Your Language Model is Secretly a Q-Function

From r to Q*: Your Language Model is Secretly a Q-Function

2024 | Rafael Rafailov*, Joey Hejna*, Ryan Park, Chelsea Finn
This paper explores the relationship between Direct Preference Optimization (DPO) and the Q-function in large language models (LLMs). The authors show that DPO can be derived as a general inverse Q-learning algorithm within the token-level Markov Decision Process (MDP) framework. They theoretically demonstrate that DPO implicitly learns a token-level reward function, with the language model's logits defining the optimal Q-function. This allows DPO to flexibly model any possible dense reward function within the token MDP. The authors provide three empirical insights: (1) DPO can perform credit assignment at the token level, (2) likelihood-based search on a DPO policy is equivalent to search-based algorithms like MCTS, and (3) the choice of reference policy affects the trajectory of implicit rewards during training. They also show that a simple beam search can yield meaningful improvements over the base DPO policy. The paper discusses the implications of these findings for various applications, including information elicitation in multi-turn dialogue, reasoning, agentic applications, and end-to-end training of multi-model systems. The authors also show that DPO can be interpreted as optimizing a per-token reward function, which is restricted to the family of optimal advantage functions. Theoretical analysis shows that DPO can learn any dense reward function in the token-level MDP. This suggests that DPO can be used for more sequential optimization tasks, such as multi-turn interactions or multi-modal generation. The paper also shows that DPO can be used to learn the optimal policy for any per-token reward function, provided preference queries start at the same state and end at a terminal state. The authors also show that DPO can be used to learn the optimal advantage function for some reward, which is responsible for credit assignment. This means that the training data determines how close the learned advantage corresponds to that of the true reward. The paper also shows that DPO can be used to connect guided decoding and search-based algorithms to likelihood-based search on the DPO policy. Finally, the authors show that the likelihood of chosen responses should decrease when using DPO, as the implicit rewards must decrease on average during training. This is because the KL-divergence between the reference policy and the optimal policy is necessarily positive at the end of training. The paper also shows that the choice of reference policy can affect the trajectory of implicit rewards during training.This paper explores the relationship between Direct Preference Optimization (DPO) and the Q-function in large language models (LLMs). The authors show that DPO can be derived as a general inverse Q-learning algorithm within the token-level Markov Decision Process (MDP) framework. They theoretically demonstrate that DPO implicitly learns a token-level reward function, with the language model's logits defining the optimal Q-function. This allows DPO to flexibly model any possible dense reward function within the token MDP. The authors provide three empirical insights: (1) DPO can perform credit assignment at the token level, (2) likelihood-based search on a DPO policy is equivalent to search-based algorithms like MCTS, and (3) the choice of reference policy affects the trajectory of implicit rewards during training. They also show that a simple beam search can yield meaningful improvements over the base DPO policy. The paper discusses the implications of these findings for various applications, including information elicitation in multi-turn dialogue, reasoning, agentic applications, and end-to-end training of multi-model systems. The authors also show that DPO can be interpreted as optimizing a per-token reward function, which is restricted to the family of optimal advantage functions. Theoretical analysis shows that DPO can learn any dense reward function in the token-level MDP. This suggests that DPO can be used for more sequential optimization tasks, such as multi-turn interactions or multi-modal generation. The paper also shows that DPO can be used to learn the optimal policy for any per-token reward function, provided preference queries start at the same state and end at a terminal state. The authors also show that DPO can be used to learn the optimal advantage function for some reward, which is responsible for credit assignment. This means that the training data determines how close the learned advantage corresponds to that of the true reward. The paper also shows that DPO can be used to connect guided decoding and search-based algorithms to likelihood-based search on the DPO policy. Finally, the authors show that the likelihood of chosen responses should decrease when using DPO, as the implicit rewards must decrease on average during training. This is because the KL-divergence between the reference policy and the optimal policy is necessarily positive at the end of training. The paper also shows that the choice of reference policy can affect the trajectory of implicit rewards during training.
Reach us at info@study.space