Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

28 May 2024 | Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, Chang Zhou
The paper addresses the challenge of aligning Large Language Models (LLMs) with human-centric values while preserving the abilities acquired through Pre-training and Supervised Fine-tuning (SFT). It introduces the Online Merging Optimizer, which integrates the RL policy and SFT models at each optimization step to continuously regulate the training direction. By merging gradients with the parameter differences between SFT and pre-trained models, the optimizer effectively steers the gradient towards maximizing rewards while maintaining alignment with SFT. The method is demonstrated to work well across various LLM families, model sizes, RLHF algorithms, and existing model merging methods. Extensive experiments show that the Online Merging Optimizer significantly enhances alignment rewards while mitigating the alignment tax, achieving higher overall performance across 14 benchmarks. The paper also discusses the limitations and potential applications of the proposed method in other domains facing catastrophic forgetting.The paper addresses the challenge of aligning Large Language Models (LLMs) with human-centric values while preserving the abilities acquired through Pre-training and Supervised Fine-tuning (SFT). It introduces the Online Merging Optimizer, which integrates the RL policy and SFT models at each optimization step to continuously regulate the training direction. By merging gradients with the parameter differences between SFT and pre-trained models, the optimizer effectively steers the gradient towards maximizing rewards while maintaining alignment with SFT. The method is demonstrated to work well across various LLM families, model sizes, RLHF algorithms, and existing model merging methods. Extensive experiments show that the Online Merging Optimizer significantly enhances alignment rewards while mitigating the alignment tax, achieving higher overall performance across 14 benchmarks. The paper also discusses the limitations and potential applications of the proposed method in other domains facing catastrophic forgetting.
Reach us at info@study.space