[slides] Dense Reward for Free in Reinforcement Learning from Human Feedback

This paper introduces a method called **Attention Based Credit (ABC)** to improve the training of Large Language Models (LLMs) in Reinforcement Learning from Human Feedback (RLHF). RLHF has been crucial for enabling LLMs to follow instructions and produce useful assistance. However, the traditional approach involves generating completions and using a separate reward model to assign a single scalar reward at the end of an episode, which is known to be challenging to optimize. ABC leverages the attention weights calculated by the reward model during the generation process. These attention weights are used to redistribute the reward along the entire completion, effectively densifying the signal and highlighting the most important tokens. This approach does not incur additional computational cost or require extra modeling. The paper makes three key contributions: 1. **Introduction of ABC**: A simple extension to vanilla RLHF that uses attention weights to redistribute the scalar reward. 2. **Theoretical Equivalence**: Shows that ABC is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. 3. **Empirical Validation**: Demonstrates that ABC stabilizes training, accelerates learning, and may lead to better local optima in practical scenarios. The paper also discusses the challenges of optimizing sparse rewards in RL and the benefits of densifying the reward signal. It provides a detailed explanation of the ABC method, including its construction and theoretical guarantees. Experimental results across three tasks—positive generation, summarization, and single-turn dialogue—show that ABC improves the optimization process, leading to faster convergence, more stable training, and potentially better local optima.This paper introduces a method called **Attention Based Credit (ABC)** to improve the training of Large Language Models (LLMs) in Reinforcement Learning from Human Feedback (RLHF). RLHF has been crucial for enabling LLMs to follow instructions and produce useful assistance. However, the traditional approach involves generating completions and using a separate reward model to assign a single scalar reward at the end of an episode, which is known to be challenging to optimize. ABC leverages the attention weights calculated by the reward model during the generation process. These attention weights are used to redistribute the reward along the entire completion, effectively densifying the signal and highlighting the most important tokens. This approach does not incur additional computational cost or require extra modeling. The paper makes three key contributions: 1. **Introduction of ABC**: A simple extension to vanilla RLHF that uses attention weights to redistribute the scalar reward. 2. **Theoretical Equivalence**: Shows that ABC is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. 3. **Empirical Validation**: Demonstrates that ABC stabilizes training, accelerates learning, and may lead to better local optima in practical scenarios. The paper also discusses the challenges of optimizing sparse rewards in RL and the benefits of densifying the reward signal. It provides a detailed explanation of the ABC method, including its construction and theoretical guarantees. Experimental results across three tasks—positive generation, summarization, and single-turn dialogue—show that ABC improves the optimization process, leading to faster convergence, more stable training, and potentially better local optima.

Dense Reward for Free in Reinforcement Learning from Human Feedback

1 Feb 2024 | Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar