[slides] InFoBench%3A Evaluating Instruction Following Ability in Large Language Models

This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. DRFR breaks down complex instructions into simpler criteria, allowing for a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside DRFR, the paper presents INFOBENCH, a benchmark dataset comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. The authors compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowdsourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. DRFR breaks down complex instructions into simpler criteria, allowing for a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside DRFR, the paper presents INFOBENCH, a benchmark dataset comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. The authors compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowdsourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.

INFOBENCH: Evaluating Instruction Following Ability in Large Language Models

7 Jan 2024 | Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, Dong Yu