[slides] WARP%3A On the Benefits of Weight Averaged Rewarded Policies

The paper introduces a novel alignment strategy called Weight Averaged Rewarded Policies (WARP) to improve the alignment of large language models (LLMs) with human preferences while preserving pre-trained knowledge. WARP addresses the trade-off between KL regularization and reward optimization in Reinforcement Learning from Human Feedback (RLHF). The strategy involves three stages of model merging: 1. **Exponential Moving Average (EMA)**: The policy's EMA is used as a dynamic anchor in KL regularization, allowing for stable exploration and annealing of the regularization. 2. **Spherical Linear Interpolation of Task Vectors (SLERP)**: Independently fine-tuned policies are merged using SLERP, combining their strengths to create a higher reward model. 3. **Linear Interpolation Towards Initialization (LITI)**: The merged model is linearly interpolated towards the initialization, revealing a Pareto front of solutions that balance reward and KL divergence. The iterative application of these stages improves the KL-reward Pareto front, enhancing the alignment of LLMs. Experiments with the Gemma "TB" LLM demonstrate that WARP outperforms other RL alignment strategies, improving both the quality and alignment of the model. The paper also discusses the connections between WARP and distributed learning, iterated amplification, and the potential for scaling alignment in large-scale deployment.The paper introduces a novel alignment strategy called Weight Averaged Rewarded Policies (WARP) to improve the alignment of large language models (LLMs) with human preferences while preserving pre-trained knowledge. WARP addresses the trade-off between KL regularization and reward optimization in Reinforcement Learning from Human Feedback (RLHF). The strategy involves three stages of model merging: 1. **Exponential Moving Average (EMA)**: The policy's EMA is used as a dynamic anchor in KL regularization, allowing for stable exploration and annealing of the regularization. 2. **Spherical Linear Interpolation of Task Vectors (SLERP)**: Independently fine-tuned policies are merged using SLERP, combining their strengths to create a higher reward model. 3. **Linear Interpolation Towards Initialization (LITI)**: The merged model is linearly interpolated towards the initialization, revealing a Pareto front of solutions that balance reward and KL divergence. The iterative application of these stages improves the KL-reward Pareto front, enhancing the alignment of LLMs. Experiments with the Gemma "TB" LLM demonstrate that WARP outperforms other RL alignment strategies, improving both the quality and alignment of the model. The paper also discusses the connections between WARP and distributed learning, iterated amplification, and the potential for scaling alignment in large-scale deployment.

WARP: On the Benefits of Weight Averaged Rewarded Policies

24 Jun 2024 | Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem