[slides and audio] WARM%3A On the Benefits of Weight Averaged Reward Models

The paper "Weight Averaged Reward Models (WARM): On the Benefits of Weight Averaged Reward Models" addresses the challenges of reward hacking in reinforcement learning from human feedback (RLHF) for large language models (LLMs). Reward hacking occurs when LLMs exploit flaws in the reward model (RM) to achieve high rewards without meeting the intended objectives. The authors identify two primary challenges: distribution shifts during the RL process and inconsistencies in human preferences. To mitigate these issues, they propose WARM, a method that first fine-tunes multiple RMs and then averages their weights in the weight space. This approach leverages the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to traditional prediction ensembling, enhances reliability under distribution shifts, and robustness to preference inconsistencies. Experiments on summarization tasks using best-of-N and RL methods show that WARM improves overall quality and alignment of LLM predictions, achieving a 79.4% win rate against a policy RL fine-tuned with a single RM. The paper also discusses the theoretical and empirical benefits of WARM, including its efficiency, reliability, and robustness, and provides insights into the mechanisms behind these advantages.The paper "Weight Averaged Reward Models (WARM): On the Benefits of Weight Averaged Reward Models" addresses the challenges of reward hacking in reinforcement learning from human feedback (RLHF) for large language models (LLMs). Reward hacking occurs when LLMs exploit flaws in the reward model (RM) to achieve high rewards without meeting the intended objectives. The authors identify two primary challenges: distribution shifts during the RL process and inconsistencies in human preferences. To mitigate these issues, they propose WARM, a method that first fine-tunes multiple RMs and then averages their weights in the weight space. This approach leverages the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to traditional prediction ensembling, enhances reliability under distribution shifts, and robustness to preference inconsistencies. Experiments on summarization tasks using best-of-N and RL methods show that WARM improves overall quality and alignment of LLM predictions, achieving a 79.4% win rate against a policy RL fine-tuned with a single RM. The paper also discusses the theoretical and empirical benefits of WARM, including its efficiency, reliability, and robustness, and provides insights into the mechanisms behind these advantages.

WARM: On the Benefits of Weight Averaged Reward Models

22 Jan 2024 | Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret