[slides and audio] Benchmarking Complex Instruction-Following with Multiple Constraints Composition

The paper introduces COMPLEXBENCH, a comprehensive benchmark for evaluating the ability of large language models (LLMs) to follow complex instructions with multiple constraints. Existing benchmarks often focus on modeling different types of constraints in human instructions but neglect the composition of these constraints, which is crucial in complex instructions. COMPLEXBENCH proposes a hierarchical taxonomy of complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types. The dataset is manually collected to cover all types of constraints and compositions. A new evaluation method, called rule-augmented LLM-based evaluation, is introduced to integrate LLM-based and rule-based methods, effectively verifying each constraint and composition type. The final score is aggregated based on the dependency structure determined by the composition types. Experiments on various LLMs reveal significant deficiencies in handling complex instructions with multiple constraints, particularly in open-source models. The paper also discusses the limitations of the benchmark, such as its exclusive focus on Chinese content and the potential biases of LLM-based evaluation methods.The paper introduces COMPLEXBENCH, a comprehensive benchmark for evaluating the ability of large language models (LLMs) to follow complex instructions with multiple constraints. Existing benchmarks often focus on modeling different types of constraints in human instructions but neglect the composition of these constraints, which is crucial in complex instructions. COMPLEXBENCH proposes a hierarchical taxonomy of complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types. The dataset is manually collected to cover all types of constraints and compositions. A new evaluation method, called rule-augmented LLM-based evaluation, is introduced to integrate LLM-based and rule-based methods, effectively verifying each constraint and composition type. The final score is aggregated based on the dependency structure determined by the composition types. Experiments on various LLMs reveal significant deficiencies in handling complex instructions with multiple constraints, particularly in open-source models. The paper also discusses the limitations of the benchmark, such as its exclusive focus on Chinese content and the potential biases of LLM-based evaluation methods.

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

11 Jul 2024 | Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, Minlie Huang