[slides and audio] Iterative Data Smoothing%3A Mitigating Reward Overfitting and Overoptimization in RLHF

Reinforcement Learning from Human Feedback (RLHF) is a technique used to align language models with human-centric values. However, it faces issues such as reward overfitting and overoptimization. This paper introduces Iterative Data Smoothing (IDS), an improved reward learning algorithm designed to address these problems. The core idea of IDS is to update both the model and the data during each training epoch, replacing hard labels with soft labels. The authors provide theoretical insights and empirical evidence to demonstrate that IDS outperforms traditional methods in both multi-armed bandit and neural network settings. The main contributions include identifying the root cause of reward overfitting and overoptimization as the inadequacy of the cross-entropy loss for long-tailed preference datasets and proposing IDS to mitigate these issues.Reinforcement Learning from Human Feedback (RLHF) is a technique used to align language models with human-centric values. However, it faces issues such as reward overfitting and overoptimization. This paper introduces Iterative Data Smoothing (IDS), an improved reward learning algorithm designed to address these problems. The core idea of IDS is to update both the model and the data during each training epoch, replacing hard labels with soft labels. The authors provide theoretical insights and empirical evidence to demonstrate that IDS outperforms traditional methods in both multi-armed bandit and neural network settings. The main contributions include identifying the root cause of reward overfitting and overoptimization as the inadequacy of the cross-entropy loss for long-tailed preference datasets and proposing IDS to mitigate these issues.

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

29 Jan 2024 | Banghua Zhu, Michael I. Jordan, and Jiantao Jiao