[slides] Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

This paper introduces FBI, a novel framework for evaluating the proficiency of Evaluator LLMs in assessing four critical abilities: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. The framework uses targeted perturbations to test whether Evaluator LLMs can detect quality drops in generated answers. By creating 2400 perturbed answers across 22 perturbation categories, the authors conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators. Their findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. The code and data are available at https://github.com/AI4Bharat/FBI.This paper introduces FBI, a novel framework for evaluating the proficiency of Evaluator LLMs in assessing four critical abilities: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. The framework uses targeted perturbations to test whether Evaluator LLMs can detect quality drops in generated answers. By creating 2400 perturbed answers across 22 perturbation categories, the authors conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators. Their findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. The code and data are available at https://github.com/AI4Bharat/FBI.

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

19 Jun 2024 | Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M. Khapra