Understanding CodeR%3A Issue Resolving with Multi-Agent and Task Graphs

**Abstract:** GitHub issue resolving has gained significant attention from both academia and industry. SWE-bench is a benchmark used to measure the performance of issue-solving systems. This paper introduces CODER, a multi-agent framework that uses pre-defined task graphs to repair and resolve bugs and add new features within a code repository. On SWE-bench lite, CODER achieves a 28.33% resolution rate with only one submission per issue. The paper examines the performance impact of each design component of CODER and provides insights for advancing this research direction. **Introduction:** The rapid advancement of Large Language Models (LLMs) has reshaped various industries, including software engineering. Issue resolving is a critical task in software engineering that can be effectively addressed using LLMs. SWE-bench collects real-world issues from popular Python libraries, and LLMs are tasked with resolving these issues based on the given description and the entire repository. This task is challenging due to the need for deep reasoning about a large amount of code and incomplete information. **Contributions:** 1. **CODER Framework:** A multi-agent framework with task graphs for issue resolving. The framework includes roles such as Manager, Reproducer, Fault Localizer, Editor, and Verifier, each with specific responsibilities. 2. **Task Graphs:** A data structure that ensures precise execution of pre-designed plans and provides an easy-to-plug interface for plan injection from humans. 3. **LLM-Generated Code:** Utilizes LLM-generated code to reproduce issues and run tests, improving contextual retrieval and fault localization. **Experimental Setup:** - **Benchmarks:** SWE-bench lite, a subset of SWE-bench, consists of 300 instances focusing on functional bug fixes. - **Metrics:** Resolved%, Average Requests, and Average Tokens/Cost. - **Comparative Methods:** Various approaches, including retrieval-based and implicit patch generation methods, are compared with CODER. **Results:** CODER achieves a 28.33% resolution rate on SWE-bench lite, outperforming other methods. The framework's effectiveness is attributed to its well-designed roles and actions, as well as the use of task graphs for precise planning and execution. **Conclusion:** CODER demonstrates the importance of providing human-like problem-solving procedures for issue resolving. The framework's pre-specified task graphs simplify planning tasks for LLMs and ensure accurate plan execution. Future work will focus on building a comprehensive set of plans to resolve more issues.**Abstract:** GitHub issue resolving has gained significant attention from both academia and industry. SWE-bench is a benchmark used to measure the performance of issue-solving systems. This paper introduces CODER, a multi-agent framework that uses pre-defined task graphs to repair and resolve bugs and add new features within a code repository. On SWE-bench lite, CODER achieves a 28.33% resolution rate with only one submission per issue. The paper examines the performance impact of each design component of CODER and provides insights for advancing this research direction. **Introduction:** The rapid advancement of Large Language Models (LLMs) has reshaped various industries, including software engineering. Issue resolving is a critical task in software engineering that can be effectively addressed using LLMs. SWE-bench collects real-world issues from popular Python libraries, and LLMs are tasked with resolving these issues based on the given description and the entire repository. This task is challenging due to the need for deep reasoning about a large amount of code and incomplete information. **Contributions:** 1. **CODER Framework:** A multi-agent framework with task graphs for issue resolving. The framework includes roles such as Manager, Reproducer, Fault Localizer, Editor, and Verifier, each with specific responsibilities. 2. **Task Graphs:** A data structure that ensures precise execution of pre-designed plans and provides an easy-to-plug interface for plan injection from humans. 3. **LLM-Generated Code:** Utilizes LLM-generated code to reproduce issues and run tests, improving contextual retrieval and fault localization. **Experimental Setup:** - **Benchmarks:** SWE-bench lite, a subset of SWE-bench, consists of 300 instances focusing on functional bug fixes. - **Metrics:** Resolved%, Average Requests, and Average Tokens/Cost. - **Comparative Methods:** Various approaches, including retrieval-based and implicit patch generation methods, are compared with CODER. **Results:** CODER achieves a 28.33% resolution rate on SWE-bench lite, outperforming other methods. The framework's effectiveness is attributed to its well-designed roles and actions, as well as the use of task graphs for precise planning and execution. **Conclusion:** CODER demonstrates the importance of providing human-like problem-solving procedures for issue resolving. The framework's pre-specified task graphs simplify planning tasks for LLMs and ensure accurate plan execution. Future work will focus on building a comprehensive set of plans to resolve more issues.

CODER: ISSUE RESOLVING WITH MULTI-AGENT AND TASK GRAPHS

11 Jun 2024 | Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheschkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, Qianxiang Wang