5 Jun 2024 | Rafael Rafailov*, Yaswanth Chittepu*, Ryan Park*, Harshit Sikchi*, Joey Hejna*, W. Bradley Knox, Chelsea Finn, Scott Niekum
This paper investigates the phenomenon of reward over-optimization in Direct Alignment Algorithms (DAAs), which are alternatives to the classical Reinforcement Learning from Human Feedback (RLHF) pipeline. DAAs bypass the reward modeling phase by directly optimizing the language model's policy based on user feedback. Despite not using a separate reward model, DAAs still exhibit over-optimization trends similar to those observed in traditional RLHF methods. The paper finds that DAAs show degradation patterns in performance as the KL-divergence budget increases, often before completing a single epoch of training. Through extensive empirical experiments, the authors formalize the reward over-optimization problem for DAAs and explore its consequences across different objectives, training regimes, and model scales. They also derive scaling laws for this phenomenon, showing that the relationship between KL-divergence and performance aligns with previous findings in RLHF. The study highlights the challenges of over-optimization in DAAs, including the risk of reward hacking and the need for more robust training methods. The paper concludes that while DAAs offer a promising alternative to RLHF, they still face significant challenges in avoiding over-optimization and ensuring alignment with human preferences.This paper investigates the phenomenon of reward over-optimization in Direct Alignment Algorithms (DAAs), which are alternatives to the classical Reinforcement Learning from Human Feedback (RLHF) pipeline. DAAs bypass the reward modeling phase by directly optimizing the language model's policy based on user feedback. Despite not using a separate reward model, DAAs still exhibit over-optimization trends similar to those observed in traditional RLHF methods. The paper finds that DAAs show degradation patterns in performance as the KL-divergence budget increases, often before completing a single epoch of training. Through extensive empirical experiments, the authors formalize the reward over-optimization problem for DAAs and explore its consequences across different objectives, training regimes, and model scales. They also derive scaling laws for this phenomenon, showing that the relationship between KL-divergence and performance aligns with previous findings in RLHF. The study highlights the challenges of over-optimization in DAAs, including the risk of reward hacking and the need for more robust training methods. The paper concludes that while DAAs offer a promising alternative to RLHF, they still face significant challenges in avoiding over-optimization and ensuring alignment with human preferences.