This paper introduces *RealMistake*, a benchmark for evaluating error detection methods in responses from Large Language Models (LLMs). The benchmark consists of three challenging and meaningful tasks that introduce objective, realistic, and diverse errors in four categories: reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge. The tasks are designed to elicit naturally observed and diverse errors in responses from GPT-4 and Llama 270B, annotated by experts. The paper evaluates 12 LLMs (7 open-source and 5 closed-source models) using *RealMistake* and finds that top LLMs like GPT-4 and Claude 3 perform very poorly in detecting their own errors, with low recall. The explanations provided by LLM-based error detectors are unreliable, and popular approaches to improving LLMs, such as self-consistency and majority voting, do not significantly enhance error detection performance. The benchmark and code are available at https://github.com/psunlpgroup/RealMistake.This paper introduces *RealMistake*, a benchmark for evaluating error detection methods in responses from Large Language Models (LLMs). The benchmark consists of three challenging and meaningful tasks that introduce objective, realistic, and diverse errors in four categories: reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge. The tasks are designed to elicit naturally observed and diverse errors in responses from GPT-4 and Llama 270B, annotated by experts. The paper evaluates 12 LLMs (7 open-source and 5 closed-source models) using *RealMistake* and finds that top LLMs like GPT-4 and Claude 3 perform very poorly in detecting their own errors, with low recall. The explanations provided by LLM-based error detectors are unreliable, and popular approaches to improving LLMs, such as self-consistency and majority voting, do not significantly enhance error detection performance. The benchmark and code are available at https://github.com/psunlpgroup/RealMistake.