DebugBench: Evaluating Debugging Capability of Large Language Models

DebugBench: Evaluating Debugging Capability of Large Language Models

6 Jun 2024 | Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yin Xu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun
**Summary:** This paper introduces DebugBench, a new benchmark for evaluating the debugging capabilities of large language models (LLMs). The benchmark consists of 4,253 instances across four major bug categories and 18 minor types in C++, Java, and Python. The dataset was constructed by collecting code snippets from LeetCode, implanting bugs using GPT-4, and ensuring rigorous quality checks. The benchmark was evaluated on two commercial models (gpt-3.5-turbo-0613 and gpt-4-0613) and four open-source models (CodeLlama-7b-Instruct, Llama-3-8B-Instruct, DeepSeek-Coder-33B-Instruct, and Mixtral-8x7B-Instruct) in a zero-shot scenario. Key findings include: (1) Closed-source models perform worse than humans, while open-source models have lower pass rates; (2) Debugging difficulty varies by bug type, with logic and multiple errors being more challenging; (3) Runtime feedback improves performance for syntax and reference errors but is unhelpful for logic errors. Additionally, debugging and code generation are positively correlated for closed-source models. The study highlights the need for further research to improve LLM debugging capabilities, particularly in handling complex errors. The dataset and code are open-sourced for research and development.**Summary:** This paper introduces DebugBench, a new benchmark for evaluating the debugging capabilities of large language models (LLMs). The benchmark consists of 4,253 instances across four major bug categories and 18 minor types in C++, Java, and Python. The dataset was constructed by collecting code snippets from LeetCode, implanting bugs using GPT-4, and ensuring rigorous quality checks. The benchmark was evaluated on two commercial models (gpt-3.5-turbo-0613 and gpt-4-0613) and four open-source models (CodeLlama-7b-Instruct, Llama-3-8B-Instruct, DeepSeek-Coder-33B-Instruct, and Mixtral-8x7B-Instruct) in a zero-shot scenario. Key findings include: (1) Closed-source models perform worse than humans, while open-source models have lower pass rates; (2) Debugging difficulty varies by bug type, with logic and multiple errors being more challenging; (3) Runtime feedback improves performance for syntax and reference errors but is unhelpful for logic errors. Additionally, debugging and code generation are positively correlated for closed-source models. The study highlights the need for further research to improve LLM debugging capabilities, particularly in handling complex errors. The dataset and code are open-sourced for research and development.
Reach us at info@study.space
[slides] DebugBench%3A Evaluating Debugging Capability of Large Language Models | StudySpace