28 May 2024 | Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, Chang Zhou
This paper introduces Online Merging Optimizers to address the alignment tax in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). The alignment tax refers to the trade-off between aligning models with human preferences and maintaining their original capabilities. The proposed method integrates the RL policy and Supervised Fine-tuning (SFT) models during each optimization step, allowing for continuous adjustment of the training direction. By merging gradients with parameter differences between SFT and pre-trained models, the method steers gradients toward maximizing rewards in the direction of SFT optimization. This approach effectively balances reward maximization and alignment tax, achieving higher performance across 14 benchmarks.
The Online Merging Optimizer is implemented as OnDARE and OnTIES, which are based on existing model merging methods. These optimizers are tested on various LLM families, including Qwen and LLaMA, across different sizes and RLHF algorithms. The results show that the Online Merging Optimizer significantly enhances alignment reward while mitigating alignment tax, outperforming traditional optimizers like AdamW and existing merging methods. The method is also effective in reducing catastrophic forgetting, a common issue in continual learning, and is compatible with various RLHF algorithms.
The paper also discusses the impact of hyperparameters such as the parameter reserve rate and merging weight on the performance of the Online Merging Optimizer. It further analyzes the complementary effects of KL constraints and online merging, showing that the combination of these techniques can lead to better alignment performance. The study concludes that Online Merging Optimizers are a promising approach for improving LLM alignment, reducing the alignment tax, and enhancing overall performance in RLHF training.This paper introduces Online Merging Optimizers to address the alignment tax in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). The alignment tax refers to the trade-off between aligning models with human preferences and maintaining their original capabilities. The proposed method integrates the RL policy and Supervised Fine-tuning (SFT) models during each optimization step, allowing for continuous adjustment of the training direction. By merging gradients with parameter differences between SFT and pre-trained models, the method steers gradients toward maximizing rewards in the direction of SFT optimization. This approach effectively balances reward maximization and alignment tax, achieving higher performance across 14 benchmarks.
The Online Merging Optimizer is implemented as OnDARE and OnTIES, which are based on existing model merging methods. These optimizers are tested on various LLM families, including Qwen and LLaMA, across different sizes and RLHF algorithms. The results show that the Online Merging Optimizer significantly enhances alignment reward while mitigating alignment tax, outperforming traditional optimizers like AdamW and existing merging methods. The method is also effective in reducing catastrophic forgetting, a common issue in continual learning, and is compatible with various RLHF algorithms.
The paper also discusses the impact of hyperparameters such as the parameter reserve rate and merging weight on the performance of the Online Merging Optimizer. It further analyzes the complementary effects of KL constraints and online merging, showing that the combination of these techniques can lead to better alignment performance. The study concludes that Online Merging Optimizers are a promising approach for improving LLM alignment, reducing the alignment tax, and enhancing overall performance in RLHF training.