The BIGGEN BENCH: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

The BIGGEN BENCH: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

9 Jun 2024 | Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
The BIGGEN BENCH is a principled benchmark for fine-grained evaluation of language models (LMs), designed to assess nine distinct capabilities across 77 diverse tasks. It uses instance-specific evaluation criteria, closely mirroring human assessment, to provide more precise evaluations. The benchmark was applied to assess 103 frontier LMs using five evaluator LMs, with all code, data, and results publicly available. The benchmark evaluates LMs on nine core capabilities: instruction following, grounding, planning, reasoning, refinement, safety, theory of mind, tool usage, and multilingualism. Each task includes specific evaluation criteria tailored to the instance, enabling detailed analysis of performance. The evaluation protocol involves human-in-the-loop construction of instances, cross-validation, and human judgments to ensure reliability. Results show that performance increases smoothly with model size scaling for base LMs, with strong linear relationships. Chat LMs also show performance improvements, though the correlation is lower than for base LMs. The performance gap between larger base and chat LMs narrows with increasing model size, indicating that post-training enhances capabilities already present in base models. Open-source and proprietary LMs show significant performance differences, with open-source models lagging in certain capabilities. Evaluator LMs demonstrate strong correlations with human judgments, with GPT-4-Turbo-2024-04-09 achieving the highest average Pearson correlation. Instance-specific evaluation criteria yield higher correlations with human judgments than coarse-grained or domain-specific criteria. The evaluation pipeline is robust against verbosity bias, as shown by weak linear relationships between response length and scores. The BIGGEN BENCH provides a comprehensive, fine-grained evaluation of LMs, highlighting the importance of instance-specific criteria and the effectiveness of evaluator LMs in assessing model capabilities. The benchmark aims to advance the development of LMs by identifying areas for improvement and ensuring fair, accessible evaluations.The BIGGEN BENCH is a principled benchmark for fine-grained evaluation of language models (LMs), designed to assess nine distinct capabilities across 77 diverse tasks. It uses instance-specific evaluation criteria, closely mirroring human assessment, to provide more precise evaluations. The benchmark was applied to assess 103 frontier LMs using five evaluator LMs, with all code, data, and results publicly available. The benchmark evaluates LMs on nine core capabilities: instruction following, grounding, planning, reasoning, refinement, safety, theory of mind, tool usage, and multilingualism. Each task includes specific evaluation criteria tailored to the instance, enabling detailed analysis of performance. The evaluation protocol involves human-in-the-loop construction of instances, cross-validation, and human judgments to ensure reliability. Results show that performance increases smoothly with model size scaling for base LMs, with strong linear relationships. Chat LMs also show performance improvements, though the correlation is lower than for base LMs. The performance gap between larger base and chat LMs narrows with increasing model size, indicating that post-training enhances capabilities already present in base models. Open-source and proprietary LMs show significant performance differences, with open-source models lagging in certain capabilities. Evaluator LMs demonstrate strong correlations with human judgments, with GPT-4-Turbo-2024-04-09 achieving the highest average Pearson correlation. Instance-specific evaluation criteria yield higher correlations with human judgments than coarse-grained or domain-specific criteria. The evaluation pipeline is robust against verbosity bias, as shown by weak linear relationships between response length and scores. The BIGGEN BENCH provides a comprehensive, fine-grained evaluation of LMs, highlighting the importance of instance-specific criteria and the effectiveness of evaluator LMs in assessing model capabilities. The benchmark aims to advance the development of LMs by identifying areas for improvement and ensuring fair, accessible evaluations.
Reach us at info@study.space
[slides] The BiGGen Bench%3A A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models | StudySpace