Dense Reward for Free in Reinforcement Learning from Human Feedback

Dense Reward for Free in Reinforcement Learning from Human Feedback

1 Feb 2024 | Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar
This paper introduces Attention Based Credit (ABC), a method to improve Reinforcement Learning from Human Feedback (RLHF) by using the attention map from the reward model to redistribute the reward across the generated text. The reward model, which is typically used to assign a single scalar score to the entire completion, contains additional information in the form of attention weights over tokens. ABC uses these attention weights to redistribute the reward on a token level, making the reward denser and more informative. This approach is theoretically equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirically, ABC is shown to stabilize training, accelerate learning, and lead to better local optima. The paper discusses the challenges of optimizing with sparse rewards in RLHF, where the reward is only given at the end of a completion. ABC addresses this by leveraging the attention mechanism of the reward model to provide more granular feedback. This allows for more efficient and stable training, as the reward is distributed across the entire generation rather than just at the end. The method is implemented with minimal changes to the standard RLHF setup and does not require additional computation or modeling. Experiments on three tasks—positive generation, summarization, and single-turn dialogue—show that ABC leads to faster and more stable training compared to traditional RLHF. The method is also shown to be more robust to longer generations, which typically result in sparser rewards. ABC achieves this by distributing the reward across the tokens, making the learning process more efficient and effective. The paper also discusses the theoretical foundations of ABC, showing that it preserves the optimal policy by being equivalent to potential-based reward shaping. This ensures that the policy learned with ABC is optimal for the original reward function. The method is implemented with a convex combination of the attention-weighted reward and the original reward, allowing for flexibility in the trade-off between the two. Overall, ABC provides a simple and effective way to improve the training of RLHF by leveraging the attention mechanism of the reward model. This approach leads to more efficient and stable training, better local optima, and improved performance in various tasks. The method is applicable to a wide range of RLHF scenarios and can be implemented with minimal changes to existing setups.This paper introduces Attention Based Credit (ABC), a method to improve Reinforcement Learning from Human Feedback (RLHF) by using the attention map from the reward model to redistribute the reward across the generated text. The reward model, which is typically used to assign a single scalar score to the entire completion, contains additional information in the form of attention weights over tokens. ABC uses these attention weights to redistribute the reward on a token level, making the reward denser and more informative. This approach is theoretically equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirically, ABC is shown to stabilize training, accelerate learning, and lead to better local optima. The paper discusses the challenges of optimizing with sparse rewards in RLHF, where the reward is only given at the end of a completion. ABC addresses this by leveraging the attention mechanism of the reward model to provide more granular feedback. This allows for more efficient and stable training, as the reward is distributed across the entire generation rather than just at the end. The method is implemented with minimal changes to the standard RLHF setup and does not require additional computation or modeling. Experiments on three tasks—positive generation, summarization, and single-turn dialogue—show that ABC leads to faster and more stable training compared to traditional RLHF. The method is also shown to be more robust to longer generations, which typically result in sparser rewards. ABC achieves this by distributing the reward across the tokens, making the learning process more efficient and effective. The paper also discusses the theoretical foundations of ABC, showing that it preserves the optimal policy by being equivalent to potential-based reward shaping. This ensures that the policy learned with ABC is optimal for the original reward function. The method is implemented with a convex combination of the attention-weighted reward and the original reward, allowing for flexibility in the trade-off between the two. Overall, ABC provides a simple and effective way to improve the training of RLHF by leveraging the attention mechanism of the reward model. This approach leads to more efficient and stable training, better local optima, and improved performance in various tasks. The method is applicable to a wide range of RLHF scenarios and can be implemented with minimal changes to existing setups.
Reach us at info@study.space
Understanding Dense Reward for Free in Reinforcement Learning from Human Feedback