[slides and audio] DyVal 2%3A Dynamic Evaluation of Large Language Models by Meta Probing Agents

This paper introduces Meta Probing Agents (MPA), a dynamic and flexible evaluation protocol for large language models (LLMs) inspired by psychometrics. MPA aims to address the issue of data contamination and provide a comprehensive analysis of LLMs' abilities in language understanding, problem-solving, and domain knowledge. The protocol uses LLMs as agents to automatically transform existing evaluation problems into new ones, guided by psychometric principles. The probing agent generates new questions, while the judge agent ensures the validity and consistency of the new samples. Extensive evaluations on popular benchmarks and LLMs reveal that most LLMs perform poorly on the dynamic benchmarks, indicating room for improvement. The multifaceted analysis shows strong correlations between the basic abilities and an implicit Matthew effect on model size, suggesting that larger models have stronger correlations between these abilities. MPA can also be used as a data augmentation approach to enhance LLMs. The paper concludes with a discussion on the limitations and future directions, emphasizing the potential of LLMs as agents in evaluating their own capabilities.This paper introduces Meta Probing Agents (MPA), a dynamic and flexible evaluation protocol for large language models (LLMs) inspired by psychometrics. MPA aims to address the issue of data contamination and provide a comprehensive analysis of LLMs' abilities in language understanding, problem-solving, and domain knowledge. The protocol uses LLMs as agents to automatically transform existing evaluation problems into new ones, guided by psychometric principles. The probing agent generates new questions, while the judge agent ensures the validity and consistency of the new samples. Extensive evaluations on popular benchmarks and LLMs reveal that most LLMs perform poorly on the dynamic benchmarks, indicating room for improvement. The multifaceted analysis shows strong correlations between the basic abilities and an implicit Matthew effect on model size, suggesting that larger models have stronger correlations between these abilities. MPA can also be used as a data augmentation approach to enhance LLMs. The paper concludes with a discussion on the limitations and future directions, emphasizing the potential of LLMs as agents in evaluating their own capabilities.

Dynamic Evaluation of Large Language Models by Meta Probing Agents

2024 | Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie