24 Jun 2024 | Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cezoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem
WARP is a novel alignment strategy for large language models (LLMs) that improves their alignment with human values by optimizing the KL-reward Pareto front. The method involves three stages of weight averaging: exponential moving average (EMA) as a dynamic anchor in KL regularization, spherical linear interpolation (SLERP) to merge independently fine-tuned policies, and linear interpolation towards the initialization. These stages are applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front. Experiments with Gemma policies show that WARP improves their quality and alignment, outperforming other open-source LLMs. The method addresses challenges in reinforcement learning from human feedback (RLHF), such as reward hacking, catastrophic forgetting, and reduced diversity in generations. WARP leverages model merging techniques to balance KL regularization and reward optimization, leading to better alignment and performance. The approach is flexible and scalable, suitable for distributed learning and open-source collaboration. WARP also contributes to safer and more powerful AI systems by enabling effective alignment through iterative improvements.WARP is a novel alignment strategy for large language models (LLMs) that improves their alignment with human values by optimizing the KL-reward Pareto front. The method involves three stages of weight averaging: exponential moving average (EMA) as a dynamic anchor in KL regularization, spherical linear interpolation (SLERP) to merge independently fine-tuned policies, and linear interpolation towards the initialization. These stages are applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front. Experiments with Gemma policies show that WARP improves their quality and alignment, outperforming other open-source LLMs. The method addresses challenges in reinforcement learning from human feedback (RLHF), such as reward hacking, catastrophic forgetting, and reduced diversity in generations. WARP leverages model merging techniques to balance KL regularization and reward optimization, leading to better alignment and performance. The approach is flexible and scalable, suitable for distributed learning and open-source collaboration. WARP also contributes to safer and more powerful AI systems by enabling effective alignment through iterative improvements.