[slides] Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

The paper "Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts" addresses the issue of interpretability in reward models used for aligning large language models (LLMs) with human preferences in Reinforcement Learning from Human Feedback (RLHF). The authors propose a two-stage approach: first, training an Absolute-Rating Multi-Objective Reward Model (ArmoRM) using multi-dimensional absolute-rating data, and second, employing a Mixture-of-Experts (MoE) strategy with a gating network to scalarize the reward objectives based on the context. The ArmoRM is trained using Llama-3 8B, and the MoE gating network is a shallow MLP. The model achieves state-of-the-art performance on the RewardBench benchmark, surpassing the LLM-as-a-judge method with GPT-4 judges and approaching the performance of the larger Nemotron-4 340B reward model. The paper also discusses the limitations of traditional Bradley-Terry models and the challenges of reward hacking, emphasizing the importance of interpretable and steerable reward models to ensure alignment with human preferences. The code and model are released for public use.The paper "Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts" addresses the issue of interpretability in reward models used for aligning large language models (LLMs) with human preferences in Reinforcement Learning from Human Feedback (RLHF). The authors propose a two-stage approach: first, training an Absolute-Rating Multi-Objective Reward Model (ArmoRM) using multi-dimensional absolute-rating data, and second, employing a Mixture-of-Experts (MoE) strategy with a gating network to scalarize the reward objectives based on the context. The ArmoRM is trained using Llama-3 8B, and the MoE gating network is a shallow MLP. The model achieves state-of-the-art performance on the RewardBench benchmark, surpassing the LLM-as-a-judge method with GPT-4 judges and approaching the performance of the larger Nemotron-4 340B reward model. The paper also discusses the limitations of traditional Bradley-Terry models and the challenges of reward hacking, emphasizing the importance of interpretable and steerable reward models to ensure alignment with human preferences. The code and model are released for public use.

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

18 Jun 2024 | Haoxiang Wang*1 Wei Xiong*1 Tengyang Xie2 Han Zhao1 Tong Zhang1

18 Jun 2024 | Haoxiang Wang1 Wei Xiong1 Tengyang Xie2 Han Zhao1 Tong Zhang1