Evaluating LLMs at Detecting Errors in LLM Responses

Evaluating LLMs at Detecting Errors in LLM Responses

27 Jul 2024 | Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang
This paper introduces *RealMistake*, a benchmark for evaluating error detection methods in responses from Large Language Models (LLMs). The benchmark consists of three challenging and meaningful tasks that introduce objective, realistic, and diverse errors in four categories: reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge. The tasks are designed to elicit naturally observed and diverse errors in responses from GPT-4 and Llama 270B, annotated by experts. The paper evaluates 12 LLMs (7 open-source and 5 closed-source models) using *RealMistake* and finds that top LLMs like GPT-4 and Claude 3 perform very poorly in detecting their own errors, with low recall. The explanations provided by LLM-based error detectors are unreliable, and popular approaches to improving LLMs, such as self-consistency and majority voting, do not significantly enhance error detection performance. The benchmark and code are available at https://github.com/psunlpgroup/RealMistake.This paper introduces *RealMistake*, a benchmark for evaluating error detection methods in responses from Large Language Models (LLMs). The benchmark consists of three challenging and meaningful tasks that introduce objective, realistic, and diverse errors in four categories: reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge. The tasks are designed to elicit naturally observed and diverse errors in responses from GPT-4 and Llama 270B, annotated by experts. The paper evaluates 12 LLMs (7 open-source and 5 closed-source models) using *RealMistake* and finds that top LLMs like GPT-4 and Claude 3 perform very poorly in detecting their own errors, with low recall. The explanations provided by LLM-based error detectors are unreliable, and popular approaches to improving LLMs, such as self-consistency and majority voting, do not significantly enhance error detection performance. The benchmark and code are available at https://github.com/psunlpgroup/RealMistake.
Reach us at info@study.space
[slides] Evaluating LLMs at Detecting Errors in LLM Responses | StudySpace