DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

2024 | Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, Jun Wang
This paper introduces DS-Agent, a novel framework that leverages large language models (LLMs) and case-based reasoning (CBR) to automate data science tasks. The primary goal is to understand task requirements, build and train the best-fit machine learning (ML) models, and deploy them effectively. DS-Agent addresses the limitations of existing LLM agents, which often generate unreasonable experiment plans, by integrating CBR to enhance their problem-solving capabilities. In the development stage, DS-Agent follows a CBR framework to structure an automatic iteration pipeline. It leverages expert knowledge from Kaggle to develop experiment plans and iteratively adjusts these plans based on execution feedback. This process ensures consistent performance improvement. In the deployment stage, DS-Agent employs a simplified CBR paradigm to adapt past successful solutions for code generation, significantly reducing the demand on LLMs' foundational capabilities. Empirical results show that DS-Agent achieves a 100% success rate in the development stage and a 36% improvement in the deployment stage compared to alternative LLMs. DS-Agent also demonstrates superior performance in both stages, with costs of $1.60 and $0.13 per run using GPT-4. The framework is open-sourced at <https://github.com/guosyjlu/DS-Agent>. The paper explores the potential of LLM agents in automating data science tasks, focusing on machine learning. It highlights the challenges faced by existing LLM agents and proposes DS-Agent as a solution. The introduction discusses the importance of data science tasks and the limitations of current LLM agents, emphasizing the role of CBR in enhancing their performance. The development stage of DS-Agent is detailed, including the collection of human insights from Kaggle, the automatic iteration pipeline, and the iterative adjustment of experiment plans. The deployment stage involves a simplified CBR framework to adapt past solutions for code generation. The experimental results demonstrate the effectiveness of DS-Agent in both stages, with significant improvements over baseline agents. The paper concludes by discussing the potential ethical concerns and future work directions.This paper introduces DS-Agent, a novel framework that leverages large language models (LLMs) and case-based reasoning (CBR) to automate data science tasks. The primary goal is to understand task requirements, build and train the best-fit machine learning (ML) models, and deploy them effectively. DS-Agent addresses the limitations of existing LLM agents, which often generate unreasonable experiment plans, by integrating CBR to enhance their problem-solving capabilities. In the development stage, DS-Agent follows a CBR framework to structure an automatic iteration pipeline. It leverages expert knowledge from Kaggle to develop experiment plans and iteratively adjusts these plans based on execution feedback. This process ensures consistent performance improvement. In the deployment stage, DS-Agent employs a simplified CBR paradigm to adapt past successful solutions for code generation, significantly reducing the demand on LLMs' foundational capabilities. Empirical results show that DS-Agent achieves a 100% success rate in the development stage and a 36% improvement in the deployment stage compared to alternative LLMs. DS-Agent also demonstrates superior performance in both stages, with costs of $1.60 and $0.13 per run using GPT-4. The framework is open-sourced at <https://github.com/guosyjlu/DS-Agent>. The paper explores the potential of LLM agents in automating data science tasks, focusing on machine learning. It highlights the challenges faced by existing LLM agents and proposes DS-Agent as a solution. The introduction discusses the importance of data science tasks and the limitations of current LLM agents, emphasizing the role of CBR in enhancing their performance. The development stage of DS-Agent is detailed, including the collection of human insights from Kaggle, the automatic iteration pipeline, and the iterative adjustment of experiment plans. The deployment stage involves a simplified CBR framework to adapt past solutions for code generation. The experimental results demonstrate the effectiveness of DS-Agent in both stages, with significant improvements over baseline agents. The paper concludes by discussing the potential ethical concerns and future work directions.
Reach us at info@study.space