8 Jun 2024 | Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
The paper introduces REWARDBENCH, a benchmark dataset and codebase for evaluating reward models (RMs) used in Reinforcement Learning from Human Feedback (RLHF). The dataset includes prompt-chosen-rejected trios across chat, reasoning, and safety tasks, enabling evaluation of how RMs perform on challenging and out-of-distribution queries. The benchmark includes specific comparison datasets for RMs with subtle, verifiable reasons for preference, such as bugs or incorrect facts. The REWARDBENCH leaderboard evaluates RMs trained with various methods, including direct MLE training of classifiers and implicit reward modeling via Direct Preference Optimization (DPO). The paper presents findings on the propensity for refusals, reasoning limitations, and instruction-following shortcomings of various RMs, aiming to improve understanding of the RLHF process.
The dataset is structured into five sections: Chat, Chat Hard, Safety, Reasoning, and Prior Sets. Each section contains prompt-chosen-rejected pairs, with scores computed based on the reward model's preference between the chosen and rejected completions. The REWARDBENCH leaderboard evaluates over 80 models, including those trained as classifiers and those using DPO. The paper highlights the performance differences between DPO and classifier-based RMs, showing that DPO models often fail to generalize to popular preference data test sets and exhibit higher variance in performance.
The paper also discusses the limitations of current preference data test sets, noting that they often have low ceilings on accuracy and high variance in performance. The results show that large models and those trained on Llama 3 perform best on the Chat Hard and Reasoning sections, while smaller models show varying performance across different tasks. The paper emphasizes the need for further research to understand the full limitations of existing datasets and to improve the performance of RMs in challenging instruction and reasoning tasks. The REWARDBENCH toolkit provides a common inference stack for various models and includes text-score pairs for performance analysis. The paper concludes that REWARDBENCH is a valuable tool for evaluating the safety and performance of reward models in RLHF.The paper introduces REWARDBENCH, a benchmark dataset and codebase for evaluating reward models (RMs) used in Reinforcement Learning from Human Feedback (RLHF). The dataset includes prompt-chosen-rejected trios across chat, reasoning, and safety tasks, enabling evaluation of how RMs perform on challenging and out-of-distribution queries. The benchmark includes specific comparison datasets for RMs with subtle, verifiable reasons for preference, such as bugs or incorrect facts. The REWARDBENCH leaderboard evaluates RMs trained with various methods, including direct MLE training of classifiers and implicit reward modeling via Direct Preference Optimization (DPO). The paper presents findings on the propensity for refusals, reasoning limitations, and instruction-following shortcomings of various RMs, aiming to improve understanding of the RLHF process.
The dataset is structured into five sections: Chat, Chat Hard, Safety, Reasoning, and Prior Sets. Each section contains prompt-chosen-rejected pairs, with scores computed based on the reward model's preference between the chosen and rejected completions. The REWARDBENCH leaderboard evaluates over 80 models, including those trained as classifiers and those using DPO. The paper highlights the performance differences between DPO and classifier-based RMs, showing that DPO models often fail to generalize to popular preference data test sets and exhibit higher variance in performance.
The paper also discusses the limitations of current preference data test sets, noting that they often have low ceilings on accuracy and high variance in performance. The results show that large models and those trained on Llama 3 perform best on the Chat Hard and Reasoning sections, while smaller models show varying performance across different tasks. The paper emphasizes the need for further research to understand the full limitations of existing datasets and to improve the performance of RMs in challenging instruction and reasoning tasks. The REWARDBENCH toolkit provides a common inference stack for various models and includes text-score pairs for performance analysis. The paper concludes that REWARDBENCH is a valuable tool for evaluating the safety and performance of reward models in RLHF.