[slides] SWT-Bench%3A Testing and Validating Real-World Bug-Fixes with Code Agents

Code agents are advanced software testers that can generate test cases based on user issues. The paper introduces SWT-BENCH, a new benchmark for test generation based on real-world software repositories, user issues, code patches, and test cases. The benchmark is created by transforming the popular SWE-BENCH dataset from code repair to test generation. The paper evaluates various test generation methods, including state-of-the-art LLM-based approaches and Code Agents, on SWT-BENCH. The results show that Code Agents, particularly SWE-AGENT, outperform other methods in test generation, achieving higher coverage and more accurate test cases. The paper also demonstrates that generated tests can serve as a strong signal for the correctness of proposed code fixes, with SWE-AGENT achieving twice the precision on fixes that pass self-generated tests that failed before the fix was applied. The paper highlights the potential of Code Agents for test generation and suggests that further improvements can be made with test-specific agents. The paper also discusses the limitations of the benchmark, including its focus on Python and potential selection biases. Overall, the paper shows that Code Agents are effective for test generation and can significantly improve the quality of software testing.Code agents are advanced software testers that can generate test cases based on user issues. The paper introduces SWT-BENCH, a new benchmark for test generation based on real-world software repositories, user issues, code patches, and test cases. The benchmark is created by transforming the popular SWE-BENCH dataset from code repair to test generation. The paper evaluates various test generation methods, including state-of-the-art LLM-based approaches and Code Agents, on SWT-BENCH. The results show that Code Agents, particularly SWE-AGENT, outperform other methods in test generation, achieving higher coverage and more accurate test cases. The paper also demonstrates that generated tests can serve as a strong signal for the correctness of proposed code fixes, with SWE-AGENT achieving twice the precision on fixes that pass self-generated tests that failed before the fix was applied. The paper highlights the potential of Code Agents for test generation and suggests that further improvements can be made with test-specific agents. The paper also discusses the limitations of the benchmark, including its focus on Python and potential selection biases. Overall, the paper shows that Code Agents are effective for test generation and can significantly improve the quality of software testing.

Code Agents are State of the Art Software Testers

2024 | Niels Mündler, Mark Niklas Müller, Jingxuan He, Martin Vechev