**KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models**
**Abstract:**
Automatic evaluation methods for large language models (LLMs) often suffer from data contamination, leading to inflated assessments of their effectiveness. Existing strategies focus on detecting contaminated texts but fail to accurately gauge model performance. This paper introduces KIEval, a Knowledge-grounded Interactive Evaluation framework that incorporates an LLM-powered "inter-actor" to achieve dynamic contamination-resilient evaluation. KIEval uses multi-round, knowledge-focused dialogues to determine if a model's response is a simple recall of benchmark answers or demonstrates deep comprehension and applies knowledge in complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. The study reveals that data contamination has no contribution or even a negative effect on models' real-world applicability and understanding, and current contamination detection methods for LLMs can only identify contamination during pre-training, not during supervised fine-tuning.
**Introduction:**
The landscape of artificial intelligence has been significantly reshaped by the emergence of LLMs, which are pivotal in various natural language understanding and generation tasks. However, automatic evaluation methods for LLMs face challenges due to data contamination, leading to overestimating their real-world efficacy. KIEval addresses this issue by introducing a novel "interactor" role, powered by an LLM, to evaluate models through dynamic, multi-round dialogues. This approach ensures that models are assessed based on their ability to apply knowledge in complex conversations, rather than simply recalling benchmark answers. KIEval is designed to be contamination-resilient and generalized, leveraging high-quality benchmark datasets as knowledge sources.
**Methodology:**
KIEval involves iterative interactions where an interactor model generates questions from existing benchmarks, challenging the candidate model with context-rich scenarios. An evaluator model assesses the candidate's responses for accuracy, relevance, and coherence. The framework emphasizes reproducibility and consistency, ensuring that the dialogue context remains consistent across different evaluations. KIEval uses deterministic outputs from LLMs and incorporates an early stopping mechanism to ensure meaningful conversations.
**Experiments:**
Experiments evaluate popular LLMs on five benchmark datasets, including ARC-Easy, ARC-Challenge, HellaSwag, MMLU, and C-Eval. The results show that GPT-3.5 consistently performs well, while LLaMA2 70B shows competitive results but exhibits a larger gap with KIEval metrics, suggesting that traditional benchmarks may underestimate performance differences. KIEval also demonstrates resilience to data contamination, outperforming static dataset-based and LLM-based evaluation methods in detecting contamination.
**Conclusion:**
KIEval provides a dynamic and contamination-resilient evaluation framework for LLMs, enhancing the reliability and alignment with human preferences. It reveals that training models on test sets primarily improves recall rather than genuine knowledge comprehension, underscoring the impact of data contamination**KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models**
**Abstract:**
Automatic evaluation methods for large language models (LLMs) often suffer from data contamination, leading to inflated assessments of their effectiveness. Existing strategies focus on detecting contaminated texts but fail to accurately gauge model performance. This paper introduces KIEval, a Knowledge-grounded Interactive Evaluation framework that incorporates an LLM-powered "inter-actor" to achieve dynamic contamination-resilient evaluation. KIEval uses multi-round, knowledge-focused dialogues to determine if a model's response is a simple recall of benchmark answers or demonstrates deep comprehension and applies knowledge in complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. The study reveals that data contamination has no contribution or even a negative effect on models' real-world applicability and understanding, and current contamination detection methods for LLMs can only identify contamination during pre-training, not during supervised fine-tuning.
**Introduction:**
The landscape of artificial intelligence has been significantly reshaped by the emergence of LLMs, which are pivotal in various natural language understanding and generation tasks. However, automatic evaluation methods for LLMs face challenges due to data contamination, leading to overestimating their real-world efficacy. KIEval addresses this issue by introducing a novel "interactor" role, powered by an LLM, to evaluate models through dynamic, multi-round dialogues. This approach ensures that models are assessed based on their ability to apply knowledge in complex conversations, rather than simply recalling benchmark answers. KIEval is designed to be contamination-resilient and generalized, leveraging high-quality benchmark datasets as knowledge sources.
**Methodology:**
KIEval involves iterative interactions where an interactor model generates questions from existing benchmarks, challenging the candidate model with context-rich scenarios. An evaluator model assesses the candidate's responses for accuracy, relevance, and coherence. The framework emphasizes reproducibility and consistency, ensuring that the dialogue context remains consistent across different evaluations. KIEval uses deterministic outputs from LLMs and incorporates an early stopping mechanism to ensure meaningful conversations.
**Experiments:**
Experiments evaluate popular LLMs on five benchmark datasets, including ARC-Easy, ARC-Challenge, HellaSwag, MMLU, and C-Eval. The results show that GPT-3.5 consistently performs well, while LLaMA2 70B shows competitive results but exhibits a larger gap with KIEval metrics, suggesting that traditional benchmarks may underestimate performance differences. KIEval also demonstrates resilience to data contamination, outperforming static dataset-based and LLM-based evaluation methods in detecting contamination.
**Conclusion:**
KIEval provides a dynamic and contamination-resilient evaluation framework for LLMs, enhancing the reliability and alignment with human preferences. It reveals that training models on test sets primarily improves recall rather than genuine knowledge comprehension, underscoring the impact of data contamination