[slides and audio] AgentBoard%3A An Analytical Evaluation Board of Multi-turn LLM Agents

**AGENTBOARD: An Analytical Evaluation Board of Multi-Turn LLM Agents** **Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He** The University of Hong Kong, Zhejiang University, Shanghai Jiao Tong University, Tsinghua University, School of Engineering, Westlake University, The Hong Kong University of Science and Technology **Abstract:** Evaluating large language models (LLMs) as general-purpose agents is crucial for understanding their capabilities and integrating them into practical applications. However, the evaluation process faces significant challenges, particularly in benchmarking agent performance across diverse scenarios within a unified framework, maintaining partial observability, and ensuring multi-round interactions. Current evaluation frameworks often focus solely on final success rates, providing limited insights into the evaluation process. To address these challenges, we introduce AGENTBOARD, a comprehensive benchmark and open-source evaluation framework designed for the analytical evaluation of LLM agents. AGENTBOARD offers a fine-grained progress rate metric that captures incremental advancements and a comprehensive evaluation toolkit featuring interactive visualization for multifaceted analysis. This framework not only highlights the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Ultimately, AGENTBOARD aims to demystify agent behaviors and accelerate the development of stronger LLM agents. **Introduction:** General-purpose agents capable of independent perception and action in various environments are significant milestones in Artificial Intelligence. Recent advancements in LLMs have demonstrated emergent agent abilities, such as understanding diverse environments and performing step-by-step planning through multi-round interactions. Comprehensive evaluation of LLM agents is essential for advancing this field. Task diversity, multi-round interaction, and partial observability are critical criteria for effective evaluation. However, existing benchmarks often fall short in meeting these criteria. **AGENTBOARD Overview:** AGENTBOARD is designed with uniformity and user-friendliness in mind. It includes a diverse set of 9 tasks and 1013 exemplary environments, covering embodied, game, web, and tool agents. Each environment is crafted to ensure multi-round and partially observable characteristics. AGENTBOARD introduces a *progress rate* metric to track agents' detailed advancements, providing a more nuanced understanding of their performance compared to traditional success rates. **Task Composition:** AGENTBOARD consists of four types of environments: embodied, game, web, and tool. Each environment is designed to evaluate specific aspects of agent capabilities, such as navigation, planning, and tool usage. The progress rate metric is applied to assess agents' advancements in these tasks. **Experiments:** We evaluate popular LLMs, including proprietary and open-weight models, using AGENTBOARD. The results show that progress rates are more informative than success rates, highlighting the strengths and weaknesses of different models. Proprietary models generally outperform open-weight models, with GPT-4 demonstrating exceptional**AGENTBOARD: An Analytical Evaluation Board of Multi-Turn LLM Agents** **Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He** The University of Hong Kong, Zhejiang University, Shanghai Jiao Tong University, Tsinghua University, School of Engineering, Westlake University, The Hong Kong University of Science and Technology **Abstract:** Evaluating large language models (LLMs) as general-purpose agents is crucial for understanding their capabilities and integrating them into practical applications. However, the evaluation process faces significant challenges, particularly in benchmarking agent performance across diverse scenarios within a unified framework, maintaining partial observability, and ensuring multi-round interactions. Current evaluation frameworks often focus solely on final success rates, providing limited insights into the evaluation process. To address these challenges, we introduce AGENTBOARD, a comprehensive benchmark and open-source evaluation framework designed for the analytical evaluation of LLM agents. AGENTBOARD offers a fine-grained progress rate metric that captures incremental advancements and a comprehensive evaluation toolkit featuring interactive visualization for multifaceted analysis. This framework not only highlights the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Ultimately, AGENTBOARD aims to demystify agent behaviors and accelerate the development of stronger LLM agents. **Introduction:** General-purpose agents capable of independent perception and action in various environments are significant milestones in Artificial Intelligence. Recent advancements in LLMs have demonstrated emergent agent abilities, such as understanding diverse environments and performing step-by-step planning through multi-round interactions. Comprehensive evaluation of LLM agents is essential for advancing this field. Task diversity, multi-round interaction, and partial observability are critical criteria for effective evaluation. However, existing benchmarks often fall short in meeting these criteria. **AGENTBOARD Overview:** AGENTBOARD is designed with uniformity and user-friendliness in mind. It includes a diverse set of 9 tasks and 1013 exemplary environments, covering embodied, game, web, and tool agents. Each environment is crafted to ensure multi-round and partially observable characteristics. AGENTBOARD introduces a *progress rate* metric to track agents' detailed advancements, providing a more nuanced understanding of their performance compared to traditional success rates. **Task Composition:** AGENTBOARD consists of four types of environments: embodied, game, web, and tool. Each environment is designed to evaluate specific aspects of agent capabilities, such as navigation, planning, and tool usage. The progress rate metric is applied to assess agents' advancements in these tasks. **Experiments:** We evaluate popular LLMs, including proprietary and open-weight models, using AGENTBOARD. The results show that progress rates are more informative than success rates, highlighting the strengths and weaknesses of different models. Proprietary models generally outperform open-weight models, with GPT-4 demonstrating exceptional

AGENTBOARD: AN ANALYTICAL EVALUATION BOARD OF MULTI-TURN LLM AGENTS

24 Jan 2024 | Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhen Lan, Lingpeng Kong, Junxian He*

AGENTBOARD: AN ANALYTICAL EVALUATION BOARD OF MULTI-TURN LLM AGENTS

24 Jan 2024 | Chang Ma*, Junlei Zhang*, Zhihao Zhu*, Cheng Yang*, Yujiu Yang*, Yaohui Jin*, Zhenzhen Lan*, Lingpeng Kong*, Junxian He*

24 Jan 2024 | Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhen Lan, Lingpeng Kong, Junxian He*