Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

11 Jul 2024 | Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, Minlie Huang
This paper introduces ComplexBench, a benchmark for evaluating the ability of large language models (LLMs) to follow complex instructions composed of multiple constraints. The benchmark is built based on a hierarchical taxonomy of complex instructions, which includes four constraint types (Lexical, Format, Semantic, and Utility), 19 constraint dimensions, and four composition types (Single, And, Chain, and Selection). A high-quality dataset is manually collected to cover all types of constraints and compositions in the taxonomy. To evaluate the generated texts, the benchmark proposes a rule-augmented LLM-based evaluation method. This method involves extracting evaluation segments from generated responses, solving each question with LLMs or rules, and aggregating the answers based on the dependency structure determined by the composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition. The benchmark also includes a detailed evaluation protocol, which involves comparing the generated responses with human evaluations and analyzing the performance of various LLMs on different constraint and composition types. The results show that LLMs generally perform better on Semantic and Utility constraints but struggle with Format and Lexical constraints that have explicit evaluation standards. Additionally, Chain presents severe challenges while Selection comes second. The benchmark also highlights the weaknesses of LLMs in following multi-layer tree-structured instructions. The paper concludes that ComplexBench can serve as a valuable tool for benchmarking the complex instruction-following ability of LLMs and provide useful insights for further work to improve this ability of LLMs.This paper introduces ComplexBench, a benchmark for evaluating the ability of large language models (LLMs) to follow complex instructions composed of multiple constraints. The benchmark is built based on a hierarchical taxonomy of complex instructions, which includes four constraint types (Lexical, Format, Semantic, and Utility), 19 constraint dimensions, and four composition types (Single, And, Chain, and Selection). A high-quality dataset is manually collected to cover all types of constraints and compositions in the taxonomy. To evaluate the generated texts, the benchmark proposes a rule-augmented LLM-based evaluation method. This method involves extracting evaluation segments from generated responses, solving each question with LLMs or rules, and aggregating the answers based on the dependency structure determined by the composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition. The benchmark also includes a detailed evaluation protocol, which involves comparing the generated responses with human evaluations and analyzing the performance of various LLMs on different constraint and composition types. The results show that LLMs generally perform better on Semantic and Utility constraints but struggle with Format and Lexical constraints that have explicit evaluation standards. Additionally, Chain presents severe challenges while Selection comes second. The benchmark also highlights the weaknesses of LLMs in following multi-layer tree-structured instructions. The paper concludes that ComplexBench can serve as a valuable tool for benchmarking the complex instruction-following ability of LLMs and provide useful insights for further work to improve this ability of LLMs.
Reach us at info@study.space
[slides and audio] Benchmarking Complex Instruction-Following with Multiple Constraints Composition