Dynamic Evaluation of Large Language Models by Meta Probing Agents

Dynamic Evaluation of Large Language Models by Meta Probing Agents

2024 | Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie
This paper introduces Meta Probing Agents (MPA), a dynamic evaluation protocol inspired by psychometrics to assess large language models (LLMs). MPA is designed to address data contamination and provide a multifaceted analysis of LLMs' abilities in language understanding, problem solving, and domain knowledge. The protocol uses agents to dynamically generate new evaluation questions based on psychometric principles, enabling a flexible and comprehensive assessment of LLMs. MPA consists of two agents: a probing agent that transforms existing questions into new ones and a judge agent that ensures the new questions maintain consistency with the original. The protocol supports dynamic evaluation sample generation and multifaceted ability analysis. The probing agents are trained to generate questions that test different cognitive abilities, while the judge agents validate the generated questions for accuracy and consistency. The authors evaluated MPA on several benchmarks, including MMLU, ARC-C, GSM8K, and BBH. The results showed that most LLMs performed poorly on the new benchmarks, indicating potential data contamination. The analysis revealed strong correlations between the three basic abilities, with language understanding and problem solving showing the strongest correlation. Additionally, larger models exhibited stronger correlations between abilities, suggesting an implicit "Matthew effect." MPA can also be used as a data augmentation approach to improve LLM performance. The results demonstrated that MPA-generated data can enhance the performance of LLMs, with an average improvement of 2% on MMLU and ARC-C. The study also identified various error patterns in LLMs, including issues with question understanding, problem-solving, and domain knowledge. The paper concludes that MPA is a promising approach for evaluating LLMs, providing a dynamic and flexible framework for assessing their abilities. The findings suggest that further research is needed to improve LLMs' performance and understanding of their capabilities. The authors emphasize the importance of evaluating LLMs to ensure responsible AI development and to better understand their true capabilities.This paper introduces Meta Probing Agents (MPA), a dynamic evaluation protocol inspired by psychometrics to assess large language models (LLMs). MPA is designed to address data contamination and provide a multifaceted analysis of LLMs' abilities in language understanding, problem solving, and domain knowledge. The protocol uses agents to dynamically generate new evaluation questions based on psychometric principles, enabling a flexible and comprehensive assessment of LLMs. MPA consists of two agents: a probing agent that transforms existing questions into new ones and a judge agent that ensures the new questions maintain consistency with the original. The protocol supports dynamic evaluation sample generation and multifaceted ability analysis. The probing agents are trained to generate questions that test different cognitive abilities, while the judge agents validate the generated questions for accuracy and consistency. The authors evaluated MPA on several benchmarks, including MMLU, ARC-C, GSM8K, and BBH. The results showed that most LLMs performed poorly on the new benchmarks, indicating potential data contamination. The analysis revealed strong correlations between the three basic abilities, with language understanding and problem solving showing the strongest correlation. Additionally, larger models exhibited stronger correlations between abilities, suggesting an implicit "Matthew effect." MPA can also be used as a data augmentation approach to improve LLM performance. The results demonstrated that MPA-generated data can enhance the performance of LLMs, with an average improvement of 2% on MMLU and ARC-C. The study also identified various error patterns in LLMs, including issues with question understanding, problem-solving, and domain knowledge. The paper concludes that MPA is a promising approach for evaluating LLMs, providing a dynamic and flexible framework for assessing their abilities. The findings suggest that further research is needed to improve LLMs' performance and understanding of their capabilities. The authors emphasize the importance of evaluating LLMs to ensure responsible AI development and to better understand their true capabilities.
Reach us at info@study.space