Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

2024 | Haoxiang Wang*1, Wei Xiong*1, Tengyang Xie2, Han Zhao1, Tong Zhang1
This paper introduces a two-stage approach for building interpretable reward models (RMs) for Reinforcement Learning from Human Feedback (RLHF). The first stage involves training an Absolute-Rating Multi-Objective Reward Model (ArmoRM) using multi-dimensional absolute-rating data, where each dimension corresponds to a human-interpretable objective such as honesty, verbosity, and safety. The second stage employs a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. The model is trained using Llama-3 8B and a shallow MLP as the gating network. ArmoRM-Llama3-8B achieves state-of-the-art performance on RewardBench, surpassing the LLM-as-a-judge method with GPT-4 judges and approaching the performance of the much larger Nemotron-4 340B reward model. The model's interpretability allows humans to understand the internal decision processes of the RM, ensuring alignment with human preferences and preventing reward hacking. The approach addresses the limitations of traditional RMs, which are often black-box models that lack human-interpretable explanations. By incorporating multiple objectives and using a gating mechanism, the model can adapt to different contexts and reduce biases such as verbosity bias. The method is validated through extensive experiments, demonstrating its effectiveness in improving the performance of reward models for language modeling.This paper introduces a two-stage approach for building interpretable reward models (RMs) for Reinforcement Learning from Human Feedback (RLHF). The first stage involves training an Absolute-Rating Multi-Objective Reward Model (ArmoRM) using multi-dimensional absolute-rating data, where each dimension corresponds to a human-interpretable objective such as honesty, verbosity, and safety. The second stage employs a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. The model is trained using Llama-3 8B and a shallow MLP as the gating network. ArmoRM-Llama3-8B achieves state-of-the-art performance on RewardBench, surpassing the LLM-as-a-judge method with GPT-4 judges and approaching the performance of the much larger Nemotron-4 340B reward model. The model's interpretability allows humans to understand the internal decision processes of the RM, ensuring alignment with human preferences and preventing reward hacking. The approach addresses the limitations of traditional RMs, which are often black-box models that lack human-interpretable explanations. By incorporating multiple objectives and using a gating mechanism, the model can adapt to different contexts and reduce biases such as verbosity bias. The method is validated through extensive experiments, demonstrating its effectiveness in improving the performance of reward models for language modeling.
Reach us at info@study.space