Understanding Benchmark Self-Evolving%3A A Multi-Agent Framework for Dynamic LLM Evaluation

This paper introduces a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) by creating evolving instances from existing benchmarks. The framework aims to provide a more accurate and comprehensive assessment of LLMs' capabilities and limitations. It employs a multi-agent system to manipulate contexts, questions, and answers, implementing six reframing operations: scalable evaluation, robust evaluation, and fine-grained evaluation. The scalable evaluation examines LLMs' ability to generalize to diverse queries, robust evaluation counteracts shortcut biases and data noise, and fine-grained evaluation probes sub-abilities for problem-solving. The framework extends benchmark datasets of four tasks (mathematical reasoning, logical reasoning, commonsense reasoning, and reading comprehension) and evaluates various LLMs, including closed-source and open-source models. Experimental results show a general performance decline in most models, highlighting the effectiveness of the framework in revealing the true capabilities and limitations of LLMs. The framework also widens performance discrepancies between models and across tasks, aiding in more informed model selection. The paper discusses limitations, risks, and ethical considerations, emphasizing the need for further research and responsible use.This paper introduces a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) by creating evolving instances from existing benchmarks. The framework aims to provide a more accurate and comprehensive assessment of LLMs' capabilities and limitations. It employs a multi-agent system to manipulate contexts, questions, and answers, implementing six reframing operations: scalable evaluation, robust evaluation, and fine-grained evaluation. The scalable evaluation examines LLMs' ability to generalize to diverse queries, robust evaluation counteracts shortcut biases and data noise, and fine-grained evaluation probes sub-abilities for problem-solving. The framework extends benchmark datasets of four tasks (mathematical reasoning, logical reasoning, commonsense reasoning, and reading comprehension) and evaluates various LLMs, including closed-source and open-source models. Experimental results show a general performance decline in most models, highlighting the effectiveness of the framework in revealing the true capabilities and limitations of LLMs. The framework also widens performance discrepancies between models and across tasks, aiding in more informed model selection. The paper discusses limitations, risks, and ethical considerations, emphasizing the need for further research and responsible use.

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

18 Feb 2024 | Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, Xuanjing Huang