DebugBench: Evaluating Debugging Capability of Large Language Models

DebugBench: Evaluating Debugging Capability of Large Language Models

6 Jun 2024 | Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun
The paper "DebugBench: Evaluating Debugging Capability of Large Language Models" by Runchu Tian introduces DebugBench, a new benchmark designed to evaluate the debugging capabilities of large language models (LLMs). The benchmark consists of 4,253 instances covering four major bug categories and 18 minor types in C++, Java, and Python. The authors collect code snippets from LeetCode, implant bugs using GPT-4, and ensure rigorous quality checks. They evaluate two commercial models (gpt-4-0613 and gpt-3.5-turbo-0613) and four open-source models in a zero-shot scenario. Key findings include: 1. **Performance Compared to Humans**: Closed-source models perform worse than humans, while open-source models have lower pass rates. 2. **Bug Type Challenges**: Multiple and logical errors are more challenging to fix than syntax and reference errors. 3. **Runtime Feedback Impact**: Runtime feedback improves performance for syntax and reference errors but is unhelpful for logical errors. 4. **Correlation Between Debugging and Coding**: For closed-source models, debugging syntax or reference errors is easier than code generation, while logic or multiple errors can be equally hard or harder. The study also highlights the need for more practical and complex debugging scenarios and further research on how LLMs interact with Integrated Development Environments (IDEs). The data and code are open-sourced via GitHub and Hugging Face.The paper "DebugBench: Evaluating Debugging Capability of Large Language Models" by Runchu Tian introduces DebugBench, a new benchmark designed to evaluate the debugging capabilities of large language models (LLMs). The benchmark consists of 4,253 instances covering four major bug categories and 18 minor types in C++, Java, and Python. The authors collect code snippets from LeetCode, implant bugs using GPT-4, and ensure rigorous quality checks. They evaluate two commercial models (gpt-4-0613 and gpt-3.5-turbo-0613) and four open-source models in a zero-shot scenario. Key findings include: 1. **Performance Compared to Humans**: Closed-source models perform worse than humans, while open-source models have lower pass rates. 2. **Bug Type Challenges**: Multiple and logical errors are more challenging to fix than syntax and reference errors. 3. **Runtime Feedback Impact**: Runtime feedback improves performance for syntax and reference errors but is unhelpful for logical errors. 4. **Correlation Between Debugging and Coding**: For closed-source models, debugging syntax or reference errors is easier than code generation, while logic or multiple errors can be equally hard or harder. The study also highlights the need for more practical and complex debugging scenarios and further research on how LLMs interact with Integrated Development Environments (IDEs). The data and code are open-sourced via GitHub and Hugging Face.
Reach us at info@study.space
[slides and audio] DebugBench%3A Evaluating Debugging Capability of Large Language Models