WARM: On the Benefits of Weight Averaged Reward Models

WARM: On the Benefits of Weight Averaged Reward Models

22 Jan 2024 | Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
WARM: On the Benefits of Weight Averaged Reward Models This paper introduces Weight Averaged Reward Models (WARM), a novel approach to improve the reliability and robustness of reward models (RMs) used in reinforcement learning from human feedback (RLHF). The main challenges in reward modeling are distribution shifts during the RL process and inconsistencies in human preferences. WARM addresses these by fine-tuning multiple RMs and averaging their weights, which enhances the reliability of the reward model under distribution shifts and robustness to preference inconsistencies. WARM is based on the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to traditional ensembling of predictions, while also improving reliability under distribution shifts and robustness to preference inconsistencies. Experiments on summarization tasks show that WARM improves the overall quality and alignment of LLM predictions. For example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM. The paper discusses the challenges in reward modeling, including distribution shifts and inconsistent preferences. It also explores existing approaches, such as prediction ensembling (ENS), which has limitations in terms of efficiency and robustness to label noise. WARM is shown to be more efficient and robust than ENS, as it maintains a single model at inference time and reduces the risk of reward hacking. Theoretical analysis shows that WARM leverages linear mode connectivity (LMC) and weight averaging (WA) to improve the reliability of the reward model. WARM is also shown to be more robust to label corruption, as it selects invariant predictive mechanisms across different runs, thus diminishing the memorization of corrupted samples. Experiments on summarization tasks demonstrate that WARM improves performance without any memory or inference overhead. WARM mitigates reward hacking, leading to better downstream policies. The paper concludes that WARM is a promising approach for improving the alignment of LLMs with human preferences and ensuring the safe deployment of AI systems.WARM: On the Benefits of Weight Averaged Reward Models This paper introduces Weight Averaged Reward Models (WARM), a novel approach to improve the reliability and robustness of reward models (RMs) used in reinforcement learning from human feedback (RLHF). The main challenges in reward modeling are distribution shifts during the RL process and inconsistencies in human preferences. WARM addresses these by fine-tuning multiple RMs and averaging their weights, which enhances the reliability of the reward model under distribution shifts and robustness to preference inconsistencies. WARM is based on the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to traditional ensembling of predictions, while also improving reliability under distribution shifts and robustness to preference inconsistencies. Experiments on summarization tasks show that WARM improves the overall quality and alignment of LLM predictions. For example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM. The paper discusses the challenges in reward modeling, including distribution shifts and inconsistent preferences. It also explores existing approaches, such as prediction ensembling (ENS), which has limitations in terms of efficiency and robustness to label noise. WARM is shown to be more efficient and robust than ENS, as it maintains a single model at inference time and reduces the risk of reward hacking. Theoretical analysis shows that WARM leverages linear mode connectivity (LMC) and weight averaging (WA) to improve the reliability of the reward model. WARM is also shown to be more robust to label corruption, as it selects invariant predictive mechanisms across different runs, thus diminishing the memorization of corrupted samples. Experiments on summarization tasks demonstrate that WARM improves performance without any memory or inference overhead. WARM mitigates reward hacking, leading to better downstream policies. The paper concludes that WARM is a promising approach for improving the alignment of LLMs with human preferences and ensuring the safe deployment of AI systems.
Reach us at info@study.space
[slides] WARM%3A On the Benefits of Weight Averaged Reward Models | StudySpace