Understanding RewardBench%3A Evaluating Reward Models for Language Modeling

The paper introduces REWARDBENCH, a benchmark dataset and codebase for evaluating reward models (RMs) used in reinforcement learning from human feedback (RLHF). RLHF is a critical but opaque process in aligning language models with human preferences, and RMs play a central role in this alignment. However, there is a lack of resources and studies focusing on evaluating these models. REWARDBENCH aims to address this gap by providing a comprehensive framework for evaluating various RMs, including those trained as classifiers and those trained using Direct Preference Optimization (DPO). The dataset includes prompt-chosen-rejected trios spanning chat, reasoning, and safety tasks, designed to challenge and compare different RMs. The paper evaluates over 80 models, highlighting their performance on various aspects such as scaling, reasoning capabilities, and instruction following. Key findings include the differences between DPO and classifier-based RMs, the landscape of current state-of-the-art RMs, and the limitations of existing preference data test sets. The paper also discusses the broader impacts of the benchmark, including the potential for offensive content and the need for further research to correlate benchmark results with downstream training.The paper introduces REWARDBENCH, a benchmark dataset and codebase for evaluating reward models (RMs) used in reinforcement learning from human feedback (RLHF). RLHF is a critical but opaque process in aligning language models with human preferences, and RMs play a central role in this alignment. However, there is a lack of resources and studies focusing on evaluating these models. REWARDBENCH aims to address this gap by providing a comprehensive framework for evaluating various RMs, including those trained as classifiers and those trained using Direct Preference Optimization (DPO). The dataset includes prompt-chosen-rejected trios spanning chat, reasoning, and safety tasks, designed to challenge and compare different RMs. The paper evaluates over 80 models, highlighting their performance on various aspects such as scaling, reasoning capabilities, and instruction following. Key findings include the differences between DPO and classifier-based RMs, the landscape of current state-of-the-art RMs, and the limitations of existing preference data test sets. The paper also discusses the broader impacts of the benchmark, including the potential for offensive content and the need for further research to correlate benchmark results with downstream training.

Evaluating Reward Models for Language Modeling

8 Jun 2024 | Nathan Lambert, Valentina Pyatkin, Jacob Morrison, L.J. Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi