Evaluating LLMs at Detecting Errors in LLM Responses

Evaluating LLMs at Detecting Errors in LLM Responses

2024 | Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang
This paper introduces ReaLMistake, the first benchmark for evaluating error detection in large language models (LLMs). The benchmark includes 900 instances of error annotations on responses from GPT-4 and Llama 2 70B, covering four categories of errors: reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge. The benchmark was created to provide objective, realistic, and diverse errors that can be assessed by human annotators. The three tasks in ReaLMistake are designed to elicit errors that are challenging for LLMs but feasible for human annotators. The paper evaluates 12 LLMs on the error detection task in ReaLMistake. The results show that top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. The explanations provided by LLM-based error detectors are unreliable, and open-source LLMs often provide wrong reasoning even when the binary predictions are correct. The error detection performance is sensitive to small changes in prompts but remains challenging to improve. Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. The paper also discusses the challenges in creating error detection benchmarks, including the lack of benchmarks that include binary error annotations on objective, realistic, and diverse errors made by LLMs. The paper highlights the importance of creating tasks that make LLMs introduce errors similar to errors in real-world applications. The tasks should be challenging for strong LLMs to make them introduce errors, but not too difficult for humans to allow annotations and detailed analysis of error detection methods. The paper concludes that ReaLMistake provides challenging and diverse error detection tasks, and further research is needed to improve LLM-based error detectors for LLM responses. The benchmark and code are available at https://github.com/psunlpgroup/ReaLMistake.This paper introduces ReaLMistake, the first benchmark for evaluating error detection in large language models (LLMs). The benchmark includes 900 instances of error annotations on responses from GPT-4 and Llama 2 70B, covering four categories of errors: reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge. The benchmark was created to provide objective, realistic, and diverse errors that can be assessed by human annotators. The three tasks in ReaLMistake are designed to elicit errors that are challenging for LLMs but feasible for human annotators. The paper evaluates 12 LLMs on the error detection task in ReaLMistake. The results show that top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. The explanations provided by LLM-based error detectors are unreliable, and open-source LLMs often provide wrong reasoning even when the binary predictions are correct. The error detection performance is sensitive to small changes in prompts but remains challenging to improve. Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. The paper also discusses the challenges in creating error detection benchmarks, including the lack of benchmarks that include binary error annotations on objective, realistic, and diverse errors made by LLMs. The paper highlights the importance of creating tasks that make LLMs introduce errors similar to errors in real-world applications. The tasks should be challenging for strong LLMs to make them introduce errors, but not too difficult for humans to allow annotations and detailed analysis of error detection methods. The paper concludes that ReaLMistake provides challenging and diverse error detection tasks, and further research is needed to improve LLM-based error detectors for LLM responses. The benchmark and code are available at https://github.com/psunlpgroup/ReaLMistake.
Reach us at info@study.space
[slides] Evaluating LLMs at Detecting Errors in LLM Responses | StudySpace