INFOBENCH: Evaluating Instruction Following Ability in Large Language Models

INFOBENCH: Evaluating Instruction Following Ability in Large Language Models

7 Jan 2024 | Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, Dong Yu
This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models (LLMs) in following instructions. DRFR breaks down complex instructions into simpler criteria, enabling a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, the authors present INFOBENCH, a benchmark consisting of 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. The study compares DRFR with traditional scoring methods and explores annotation sources, including human experts, crowdsourced workers, and GPT-4. The findings show that DRFR is more reliable and that GPT-4 is a cost-effective annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. The study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation. The paper also introduces the INFOBENCH dataset, which includes two sets: the Easy Set and the Hard Set. The Easy Set is designed for a broad range of applications, while the Hard Set is a manually curated dataset inspired by various subject areas. The dataset includes a wide range of instructions and constraints, such as content, linguistic, style, format, and number. The authors conducted two key experiments: one comparing DRFR with traditional Direct Scoring (DS) for evaluating responses from various LLMs, and another exploring more cost-efficient annotation sources. The results showed that DRFR has higher consensus among annotators, particularly in the Hard Set, and that GPT-4 is a highly accurate, cost-effective, and time-efficient alternative for annotation. The study also evaluates six advanced LLMs using the INFOBENCH dataset. The results show that while progress has been significant, there is still a notable gap in their ability to follow instructions perfectly, particularly in more complex scenarios. The closed-source models are currently leading, possibly due to better data or more sophisticated algorithms. However, the performance across different constraint types and domains suggests that the challenges in instruction-following are more nuanced and may require focused improvements in areas like numerical and linguistic understanding. The study presents four key contributions: a novel metric (DRFR), a comprehensive benchmark (INFOBENCH), the efficacy of DRFR and INFOBENCH, and a thorough analysis of six advanced LLMs. These contributions pave the way for future advancements in LLM development and evaluation.This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models (LLMs) in following instructions. DRFR breaks down complex instructions into simpler criteria, enabling a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, the authors present INFOBENCH, a benchmark consisting of 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. The study compares DRFR with traditional scoring methods and explores annotation sources, including human experts, crowdsourced workers, and GPT-4. The findings show that DRFR is more reliable and that GPT-4 is a cost-effective annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. The study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation. The paper also introduces the INFOBENCH dataset, which includes two sets: the Easy Set and the Hard Set. The Easy Set is designed for a broad range of applications, while the Hard Set is a manually curated dataset inspired by various subject areas. The dataset includes a wide range of instructions and constraints, such as content, linguistic, style, format, and number. The authors conducted two key experiments: one comparing DRFR with traditional Direct Scoring (DS) for evaluating responses from various LLMs, and another exploring more cost-efficient annotation sources. The results showed that DRFR has higher consensus among annotators, particularly in the Hard Set, and that GPT-4 is a highly accurate, cost-effective, and time-efficient alternative for annotation. The study also evaluates six advanced LLMs using the INFOBENCH dataset. The results show that while progress has been significant, there is still a notable gap in their ability to follow instructions perfectly, particularly in more complex scenarios. The closed-source models are currently leading, possibly due to better data or more sophisticated algorithms. However, the performance across different constraint types and domains suggests that the challenges in instruction-following are more nuanced and may require focused improvements in areas like numerical and linguistic understanding. The study presents four key contributions: a novel metric (DRFR), a comprehensive benchmark (INFOBENCH), the efficacy of DRFR and INFOBENCH, and a thorough analysis of six advanced LLMs. These contributions pave the way for future advancements in LLM development and evaluation.
Reach us at info@study.space
[slides] InFoBench%3A Evaluating Instruction Following Ability in Large Language Models | StudySpace