2024-03-11 | Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu
In this paper, the authors introduce InfiAgent-DABench, a benchmark designed to evaluate large language model (LLM)-based agents on data analysis tasks. The benchmark includes DAEval, a dataset of 257 data analysis questions derived from 52 CSV files, and an agent framework that incorporates LLMs to serve as data analysis agents. To address the challenge of open-ended and difficult-to-evaluate data analysis questions, the authors adopt a format-prompting technique to convert each question into a closed-form format for automatic evaluation. Extensive benchmarking of 34 LLMs reveals the current challenges in data analysis tasks. Additionally, the authors develop DAAgent, a specialized agent that outperforms GPT-3.5 by 3.9%. The evaluation datasets and toolkits for InfiAgent-DABench are released on GitHub. The paper also discusses the construction of the dataset, the agent framework, and the evaluation process, highlighting the importance of human assessment and the limitations of current LLMs in handling data analysis tasks.In this paper, the authors introduce InfiAgent-DABench, a benchmark designed to evaluate large language model (LLM)-based agents on data analysis tasks. The benchmark includes DAEval, a dataset of 257 data analysis questions derived from 52 CSV files, and an agent framework that incorporates LLMs to serve as data analysis agents. To address the challenge of open-ended and difficult-to-evaluate data analysis questions, the authors adopt a format-prompting technique to convert each question into a closed-form format for automatic evaluation. Extensive benchmarking of 34 LLMs reveals the current challenges in data analysis tasks. Additionally, the authors develop DAAgent, a specialized agent that outperforms GPT-3.5 by 3.9%. The evaluation datasets and toolkits for InfiAgent-DABench are released on GitHub. The paper also discusses the construction of the dataset, the agent framework, and the evaluation process, highlighting the importance of human assessment and the limitations of current LLMs in handling data analysis tasks.