Code Agents are State of the Art Software Testers

Code Agents are State of the Art Software Testers

18 Jun 2024 | Niels Mündler, Mark Niklas Müller, Jingxuan He, Martin Vechev
The paper "Code Agents are State of the Art Software Testers" by Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev explores the potential of Large Language Models (LLMs) in generating test cases for software development. The authors address the gap in test generation research by proposing a novel benchmark, SWT-BENCH, which is based on real-world GitHub repositories and contains issues, ground-truth patches, and golden tests. They find that LLMs, particularly Code Agents designed for code repair, perform surprisingly well at generating relevant test cases, outperforming systems specifically designed for test generation. The evaluation metrics used include fail-to-pass rate and coverage, providing a dual perspective on the effectiveness of different approaches. The paper also highlights the complementary nature of different methods and the potential of generated tests in filtering proposed code fixes, doubling the precision of SWE-AGENT. The key contributions include the introduction of SWT-BENCH, the adaptation of Code Agents for test generation, and the extensive evaluation demonstrating the superior performance of Code Agents in test generation tasks.The paper "Code Agents are State of the Art Software Testers" by Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev explores the potential of Large Language Models (LLMs) in generating test cases for software development. The authors address the gap in test generation research by proposing a novel benchmark, SWT-BENCH, which is based on real-world GitHub repositories and contains issues, ground-truth patches, and golden tests. They find that LLMs, particularly Code Agents designed for code repair, perform surprisingly well at generating relevant test cases, outperforming systems specifically designed for test generation. The evaluation metrics used include fail-to-pass rate and coverage, providing a dual perspective on the effectiveness of different approaches. The paper also highlights the complementary nature of different methods and the potential of generated tests in filtering proposed code fixes, doubling the precision of SWE-AGENT. The key contributions include the introduction of SWT-BENCH, the adaptation of Code Agents for test generation, and the extensive evaluation demonstrating the superior performance of Code Agents in test generation tasks.
Reach us at info@study.space
Understanding SWT-Bench%3A Testing and Validating Real-World Bug-Fixes with Code Agents