Understanding CoverUp%3A Coverage-Guided LLM-Based Test Generation

COVERUP is a novel system that generates high-coverage Python regression tests using coverage analysis and large language models (LLMs). It iteratively improves coverage by focusing the LLM on code that lacks coverage, refining prompts based on coverage data. Compared to CODAMOSA, a hybrid LLM/search-based testing system, COVERUP achieves significantly higher coverage: median line coverage of 81% (vs. 62%), branch coverage of 53% (vs. 35%), and line+branch coverage of 78% (vs. 55%). COVERUP's iterative, coverage-guided approach is crucial to its effectiveness, contributing to nearly half of its successes. COVERUP prompts the LLM for tests based on code segments lacking coverage, using a chat interface to refine prompts as needed. It executes generated tests, measures coverage, and continues the dialogue with the LLM if tests fail or do not improve coverage. It also checks for integration issues and handles flaky tests by repeating tests and adjusting prompts. COVERUP also addresses module dependencies and handles test failures by disabling or isolating problematic tests. COVERUP outperforms CODAMOSA (codex) and CODAMOSA (gpt4) in coverage, achieving higher line, branch, and combined line and branch coverage across the entire benchmark suite and on a per-module basis. It also performs well on code that Pynguin already handles effectively. COVERUP's iterative dialogue with the LLM contributes to nearly half of its successes, demonstrating its effectiveness in improving coverage. The system is evaluated using the CODAMOSA benchmark suite and shows significant improvements in coverage metrics. COVERUP is available as open-source code and is designed to work with OpenAI models, with plans to adapt it for use with other models. The paper concludes that coupling coverage information with prompting and iteratively refining prompts based on updated coverage information is effective in generating high-coverage test suites.COVERUP is a novel system that generates high-coverage Python regression tests using coverage analysis and large language models (LLMs). It iteratively improves coverage by focusing the LLM on code that lacks coverage, refining prompts based on coverage data. Compared to CODAMOSA, a hybrid LLM/search-based testing system, COVERUP achieves significantly higher coverage: median line coverage of 81% (vs. 62%), branch coverage of 53% (vs. 35%), and line+branch coverage of 78% (vs. 55%). COVERUP's iterative, coverage-guided approach is crucial to its effectiveness, contributing to nearly half of its successes. COVERUP prompts the LLM for tests based on code segments lacking coverage, using a chat interface to refine prompts as needed. It executes generated tests, measures coverage, and continues the dialogue with the LLM if tests fail or do not improve coverage. It also checks for integration issues and handles flaky tests by repeating tests and adjusting prompts. COVERUP also addresses module dependencies and handles test failures by disabling or isolating problematic tests. COVERUP outperforms CODAMOSA (codex) and CODAMOSA (gpt4) in coverage, achieving higher line, branch, and combined line and branch coverage across the entire benchmark suite and on a per-module basis. It also performs well on code that Pynguin already handles effectively. COVERUP's iterative dialogue with the LLM contributes to nearly half of its successes, demonstrating its effectiveness in improving coverage. The system is evaluated using the CODAMOSA benchmark suite and shows significant improvements in coverage metrics. COVERUP is available as open-source code and is designed to work with OpenAI models, with plans to adapt it for use with other models. The paper concludes that coupling coverage information with prompting and iteratively refining prompts based on updated coverage information is effective in generating high-coverage test suites.

COVERUP: Coverage-Guided LLM-Based Test Generation

| Juan Altmayer Pizzorno, Emery D. Berger