This paper introduces a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. The framework uses a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Six reframing operations are implemented to construct evolving instances testing LLMs against diverse queries, data noise, and probing their problem-solving sub-abilities. The framework extends benchmark datasets of four tasks: mathematical reasoning (GSM8K), logical reasoning (CLUTRR), commonsense reasoning (StrategyQA), and reading comprehension (BoolQ). Experimental results show a general performance decline in most LLMs against their original results, indicating that the framework provides a more accurate reflection of models' capabilities. The framework also widens performance discrepancies between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks.
The framework includes a multi-agent system with four key components: an instance pre-filter, an instance creator, an instance verifier, and a candidate option formulator. The pre-filter selects manageable instances from the original dataset, the creator generates new instances by modifying contexts or questions, the verifier ensures the correctness of the new instances, and the candidate option formulator creates incorrect answer options for each new context-question pair. The system is powered by GPT-4 to leverage its generative and verification strengths.
The framework's scalable, robust, and fine-grained evaluations reveal that most models exhibit reduced performance compared to their original results, highlighting their limited robustness and generalizability. The framework also identifies selection bias in certain LLMs, where they tend to favor option 'A' in multiple-choice questions. Debiased results show that GPT-4 consistently performs best across all sub-abilities, while Mistral shows the lowest performance. The framework also mitigates data contamination by dynamically updating instances, reducing the performance gap between contaminated and original models. The framework's evolving instances are validated through human verification, showing high accuracy. The framework provides a more accurate and comprehensive evaluation of LLMs, helping to select the most suitable models for specific applications.This paper introduces a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. The framework uses a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Six reframing operations are implemented to construct evolving instances testing LLMs against diverse queries, data noise, and probing their problem-solving sub-abilities. The framework extends benchmark datasets of four tasks: mathematical reasoning (GSM8K), logical reasoning (CLUTRR), commonsense reasoning (StrategyQA), and reading comprehension (BoolQ). Experimental results show a general performance decline in most LLMs against their original results, indicating that the framework provides a more accurate reflection of models' capabilities. The framework also widens performance discrepancies between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks.
The framework includes a multi-agent system with four key components: an instance pre-filter, an instance creator, an instance verifier, and a candidate option formulator. The pre-filter selects manageable instances from the original dataset, the creator generates new instances by modifying contexts or questions, the verifier ensures the correctness of the new instances, and the candidate option formulator creates incorrect answer options for each new context-question pair. The system is powered by GPT-4 to leverage its generative and verification strengths.
The framework's scalable, robust, and fine-grained evaluations reveal that most models exhibit reduced performance compared to their original results, highlighting their limited robustness and generalizability. The framework also identifies selection bias in certain LLMs, where they tend to favor option 'A' in multiple-choice questions. Debiased results show that GPT-4 consistently performs best across all sub-abilities, while Mistral shows the lowest performance. The framework also mitigates data contamination by dynamically updating instances, reducing the performance gap between contaminated and original models. The framework's evolving instances are validated through human verification, showing high accuracy. The framework provides a more accurate and comprehensive evaluation of LLMs, helping to select the most suitable models for specific applications.