8 Dec 2024 | Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae Nam, Daejin Jo, Kyoung-Woon On, Mark Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo
TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
This paper introduces TLCR, a novel reward model for fine-grained reinforcement learning from human feedback (RLHF). TLCR provides continuous-scale dense reward signals at the token level, addressing the limitations of previous sequence-level or token-level discrete reward methods. The key idea is to use a discriminator trained to distinguish positive and negative tokens, and to assign continuous rewards based on the confidence of the discriminator.
The paper first discusses the challenges of using sequence-level rewards in RLHF, which can lead to a mismatch between sequence-level preference labels and tokens. It then presents TLCR, which uses a token-level preference discriminator to assign continuous rewards to each token. The discriminator is trained using an external mature language model to generate token-level preference labels from sequence-level human preference data.
The paper evaluates TLCR on open-ended generation benchmarks and shows that it outperforms previous methods in terms of performance. It also demonstrates that TLCR provides more accurate and detailed feedback compared to traditional sequence-level or token-level discrete reward methods.
The paper also discusses the limitations of TLCR, including the need for a large dataset and the potential for bias in the token-level preference discriminator. It concludes that TLCR offers a more effective and efficient approach to fine-grained RLHF compared to traditional methods.TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
This paper introduces TLCR, a novel reward model for fine-grained reinforcement learning from human feedback (RLHF). TLCR provides continuous-scale dense reward signals at the token level, addressing the limitations of previous sequence-level or token-level discrete reward methods. The key idea is to use a discriminator trained to distinguish positive and negative tokens, and to assign continuous rewards based on the confidence of the discriminator.
The paper first discusses the challenges of using sequence-level rewards in RLHF, which can lead to a mismatch between sequence-level preference labels and tokens. It then presents TLCR, which uses a token-level preference discriminator to assign continuous rewards to each token. The discriminator is trained using an external mature language model to generate token-level preference labels from sequence-level human preference data.
The paper evaluates TLCR on open-ended generation benchmarks and shows that it outperforms previous methods in terms of performance. It also demonstrates that TLCR provides more accurate and detailed feedback compared to traditional sequence-level or token-level discrete reward methods.
The paper also discusses the limitations of TLCR, including the need for a large dataset and the potential for bias in the token-level preference discriminator. It concludes that TLCR offers a more effective and efficient approach to fine-grained RLHF compared to traditional methods.