KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

3 Jun 2024 | Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Wei Ye, Jindong Wang, Xing Xie, Yue Zhang, Shikun Zhang
KIEval is a knowledge-grounded interactive evaluation framework for large language models (LLMs) that addresses the issue of data contamination in automatic evaluation methods. Traditional methods often focus on quantifying contamination rather than accurately assessing model performance. KIEval introduces an LLM-powered "interactor" to dynamically evaluate models through multi-round, knowledge-focused dialogues, distinguishing between simple recall of benchmark answers and deep comprehension. The framework uses existing benchmark datasets to pose questions requiring domain-specific knowledge and evaluates responses through structured, interactive conversations. This approach allows for a more accurate assessment of model performance and resilience to data contamination. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. The results show that KIEval achieves a high Pearson correlation coefficient of 0.81 with human scores, indicating its proficiency in reflecting human preferences. KIEval also reveals that data contamination does not contribute positively to models' real-world applicability and understanding, and existing contamination detection methods can only identify contamination in pre-training but not during fine-tuning. KIEval's core contributions include a novel dynamic evaluation protocol, extensive evaluation of popular LLMs, and new insights into data contamination. The framework is designed to be generalized and scalable, leveraging advanced LLMs as interactors to assess performance across diverse domains and tasks without significant resource expenditure. KIEval's interactive evaluation process ensures reproducibility and consistency, using separate models for interactor and evaluator roles to maintain consistent dialogue contexts across evaluations. The framework's interactive evaluation procedure involves a series of iterative interactions, with the interactor generating question prompts and the evaluator assessing responses. KIEval's evaluation metrics include a scoring system that quantitatively grades the performance of candidate LLMs in different aspects, with scores ranging from 1 to 4. The KIEval score is calculated based on the weighted sum of scores across multiple rounds of interaction, emphasizing early turns of the conversation. Experiments show that KIEval is resilient to data contamination, with cheater models performing slightly worse than 'Normal' models, indicating that training models on test sets does not bring generalizable domain knowledge but rather contributes to mere memorization. KIEval's results also show that traditional benchmarks may underestimate the difference in performance between LLMs, as these benchmarks focus on testing understanding ability rather than generative tasks. KIEval's alignment with human judgment is validated through human evaluation, showing substantial agreement among annotators and a robust correlation with human preferences. The framework's cost-effectiveness and scalability are further validated through computational resource and API usage analysis. Overall, KIEval provides a dynamic evaluation and analysis of LLMs across various domains, evaluating generative abilities and domain knowledge through structured conversations, reducing the risk of data contamination and enhancing the reliability of evaluations.KIEval is a knowledge-grounded interactive evaluation framework for large language models (LLMs) that addresses the issue of data contamination in automatic evaluation methods. Traditional methods often focus on quantifying contamination rather than accurately assessing model performance. KIEval introduces an LLM-powered "interactor" to dynamically evaluate models through multi-round, knowledge-focused dialogues, distinguishing between simple recall of benchmark answers and deep comprehension. The framework uses existing benchmark datasets to pose questions requiring domain-specific knowledge and evaluates responses through structured, interactive conversations. This approach allows for a more accurate assessment of model performance and resilience to data contamination. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. The results show that KIEval achieves a high Pearson correlation coefficient of 0.81 with human scores, indicating its proficiency in reflecting human preferences. KIEval also reveals that data contamination does not contribute positively to models' real-world applicability and understanding, and existing contamination detection methods can only identify contamination in pre-training but not during fine-tuning. KIEval's core contributions include a novel dynamic evaluation protocol, extensive evaluation of popular LLMs, and new insights into data contamination. The framework is designed to be generalized and scalable, leveraging advanced LLMs as interactors to assess performance across diverse domains and tasks without significant resource expenditure. KIEval's interactive evaluation process ensures reproducibility and consistency, using separate models for interactor and evaluator roles to maintain consistent dialogue contexts across evaluations. The framework's interactive evaluation procedure involves a series of iterative interactions, with the interactor generating question prompts and the evaluator assessing responses. KIEval's evaluation metrics include a scoring system that quantitatively grades the performance of candidate LLMs in different aspects, with scores ranging from 1 to 4. The KIEval score is calculated based on the weighted sum of scores across multiple rounds of interaction, emphasizing early turns of the conversation. Experiments show that KIEval is resilient to data contamination, with cheater models performing slightly worse than 'Normal' models, indicating that training models on test sets does not bring generalizable domain knowledge but rather contributes to mere memorization. KIEval's results also show that traditional benchmarks may underestimate the difference in performance between LLMs, as these benchmarks focus on testing understanding ability rather than generative tasks. KIEval's alignment with human judgment is validated through human evaluation, showing substantial agreement among annotators and a robust correlation with human preferences. The framework's cost-effectiveness and scalability are further validated through computational resource and API usage analysis. Overall, KIEval provides a dynamic evaluation and analysis of LLMs across various domains, evaluating generative abilities and domain knowledge through structured conversations, reducing the risk of data contamination and enhancing the reliability of evaluations.
Reach us at info@study.space