11 Mar 2024 | Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu
InfiAgent-DABench is a benchmark designed to evaluate large language model (LLM)-based agents on data analysis tasks. The benchmark includes DAEval, a dataset of 257 data analysis questions derived from 52 CSV files, and an agent framework that incorporates LLMs to serve as data analysis agents for both serving and evaluation. To address the challenge of evaluating open-ended data analysis questions without human supervision, a format-prompting technique is used to convert each question into a closed-form format for automatic evaluation. The benchmark evaluates 34 LLMs and reveals current challenges in data analysis tasks. Additionally, a specialized agent, DAAgent, is developed which surpasses GPT-3.5 by 3.9% on DABench. The dataset is strictly assessed by human experts in multiple dimensions, and all unqualified samples are filtered. The benchmark also includes an instruction-tuning dataset, DAInstruct, for data analysis tasks, and develops DAAgent, an open-source data analysis agent, which achieves better performance than GPT-3.5. The benchmark provides a comprehensive evaluation of LLM-based agents on data analysis tasks, highlighting the challenges and potential improvements in this area.InfiAgent-DABench is a benchmark designed to evaluate large language model (LLM)-based agents on data analysis tasks. The benchmark includes DAEval, a dataset of 257 data analysis questions derived from 52 CSV files, and an agent framework that incorporates LLMs to serve as data analysis agents for both serving and evaluation. To address the challenge of evaluating open-ended data analysis questions without human supervision, a format-prompting technique is used to convert each question into a closed-form format for automatic evaluation. The benchmark evaluates 34 LLMs and reveals current challenges in data analysis tasks. Additionally, a specialized agent, DAAgent, is developed which surpasses GPT-3.5 by 3.9% on DABench. The dataset is strictly assessed by human experts in multiple dimensions, and all unqualified samples are filtered. The benchmark also includes an instruction-tuning dataset, DAInstruct, for data analysis tasks, and develops DAAgent, an open-source data analysis agent, which achieves better performance than GPT-3.5. The benchmark provides a comprehensive evaluation of LLM-based agents on data analysis tasks, highlighting the challenges and potential improvements in this area.