19 Jun 2024 | Sumanth Doddapaneni*,1,2, Mohammed Safi Ur Rahman Khan*,1,2, Sshubam Verma1, Mitesh M. Khapra1,2
This paper investigates the effectiveness of Large Language Models (LLMs) as evaluators for other LLMs, focusing on four critical abilities: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. The authors introduce FBI, a novel framework designed to assess the proficiency of Evaluator LLMs using targeted perturbations. By creating 2400 perturbed answers across 22 categories, they conduct a comprehensive study using three evaluation paradigms: single-answer, pairwise, and reference-guided evaluation. The results reveal significant shortcomings in current Evaluator LLMs, with over 50% of cases failing to identify quality drops. Single-answer and pairwise evaluations show notable limitations, while reference-based evaluations perform relatively better. The findings highlight the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. The FBI framework is intended to be further extended and used for continued meta-evaluation of Evaluator LLMs.This paper investigates the effectiveness of Large Language Models (LLMs) as evaluators for other LLMs, focusing on four critical abilities: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. The authors introduce FBI, a novel framework designed to assess the proficiency of Evaluator LLMs using targeted perturbations. By creating 2400 perturbed answers across 22 categories, they conduct a comprehensive study using three evaluation paradigms: single-answer, pairwise, and reference-guided evaluation. The results reveal significant shortcomings in current Evaluator LLMs, with over 50% of cases failing to identify quality drops. Single-answer and pairwise evaluations show notable limitations, while reference-based evaluations perform relatively better. The findings highlight the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. The FBI framework is intended to be further extended and used for continued meta-evaluation of Evaluator LLMs.