9 Jun 2024 | Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
The BIGGEN BENCH is a principled benchmark designed to evaluate the fine-grained capabilities of language models (LMs) across 77 diverse tasks. It introduces instance-specific evaluation criteria, mirroring human assessment nuances, to capture subtle details and variability in responses. The benchmark assesses nine core capabilities: instruction following, grounding, planning, reasoning, refinement, safety, theory of mind, tool usage, and multilingualism. The evaluation is conducted using five evaluator LMs, assessing 103 frontier LMs ranging from 1 billion to 141 billion parameters. Key findings include smooth and predictable performance improvements with model scaling, significant gaps in reasoning and tool usage between pre-trained and post-trained LMs, and notable performance differences between open-source and proprietary LMs. The study also demonstrates that evaluator LMs can reliably mimic human judgment, with GPT-4-Turbo-2024-04-09 achieving the highest Pearson correlation. The paper highlights the importance of fine-grained evaluation criteria and the robustness of the evaluation pipeline against verbosity biases.The BIGGEN BENCH is a principled benchmark designed to evaluate the fine-grained capabilities of language models (LMs) across 77 diverse tasks. It introduces instance-specific evaluation criteria, mirroring human assessment nuances, to capture subtle details and variability in responses. The benchmark assesses nine core capabilities: instruction following, grounding, planning, reasoning, refinement, safety, theory of mind, tool usage, and multilingualism. The evaluation is conducted using five evaluator LMs, assessing 103 frontier LMs ranging from 1 billion to 141 billion parameters. Key findings include smooth and predictable performance improvements with model scaling, significant gaps in reasoning and tool usage between pre-trained and post-trained LMs, and notable performance differences between open-source and proprietary LMs. The study also demonstrates that evaluator LMs can reliably mimic human judgment, with GPT-4-Turbo-2024-04-09 achieving the highest Pearson correlation. The paper highlights the importance of fine-grained evaluation criteria and the robustness of the evaluation pipeline against verbosity biases.