April 2024 | Kaibo Liu, Yiyang Liu, Zhenpeng Chen, Jie M. Zhang, Yudong Han, Yun Ma, Ge Li, Gang Huang
The paper "LLM-Powered Test Case Generation for Detecting Tricky Bugs" by Kaibo Liu addresses the challenge of generating effective test cases to detect tricky bugs in programs, particularly those that have passed existing tests. Conventional automated test generation tools struggle with this task, often producing low-precision test oracles. To address this, the paper introduces AID (Automated Inference with Differential Testing), a method that combines Large Language Models (LLMs) with differential testing to generate fault-revealing test inputs and oracles for plausibly correct programs.
AID consists of three main components:
1. **PUT-guided program generation**: Utilizes the program under test (PUT) and its specification to generate program variants.
2. **Generator-based input generation**: Uses LLMs to create input generators to produce legal test inputs.
3. **Diversity-first differential testing**: Prioritizes diverse test outputs to construct more accurate test oracles.
The evaluation of AID on two large-scale datasets, TrickyBugs and EvalPlus, shows significant improvements over state-of-the-art baselines in terms of recall, precision, and F1 score. AID outperforms the best baseline by up to 1.80×, 2.65×, and 1.66×, respectively. The paper also includes an extensive ablation study to demonstrate the effectiveness of each component of AID.
The contributions of the paper include:
- A novel LLM-powered test oracle generation approach.
- Extensive evaluation using comprehensive datasets.
- A replication package to facilitate future research.
The paper concludes by highlighting the practical value of AID in generating defect-identifying test cases for plausible programs, making it a significant advancement in automated test generation.The paper "LLM-Powered Test Case Generation for Detecting Tricky Bugs" by Kaibo Liu addresses the challenge of generating effective test cases to detect tricky bugs in programs, particularly those that have passed existing tests. Conventional automated test generation tools struggle with this task, often producing low-precision test oracles. To address this, the paper introduces AID (Automated Inference with Differential Testing), a method that combines Large Language Models (LLMs) with differential testing to generate fault-revealing test inputs and oracles for plausibly correct programs.
AID consists of three main components:
1. **PUT-guided program generation**: Utilizes the program under test (PUT) and its specification to generate program variants.
2. **Generator-based input generation**: Uses LLMs to create input generators to produce legal test inputs.
3. **Diversity-first differential testing**: Prioritizes diverse test outputs to construct more accurate test oracles.
The evaluation of AID on two large-scale datasets, TrickyBugs and EvalPlus, shows significant improvements over state-of-the-art baselines in terms of recall, precision, and F1 score. AID outperforms the best baseline by up to 1.80×, 2.65×, and 1.66×, respectively. The paper also includes an extensive ablation study to demonstrate the effectiveness of each component of AID.
The contributions of the paper include:
- A novel LLM-powered test oracle generation approach.
- Extensive evaluation using comprehensive datasets.
- A replication package to facilitate future research.
The paper concludes by highlighting the practical value of AID in generating defect-identifying test cases for plausible programs, making it a significant advancement in automated test generation.