[slides] On the Evaluation of Large Language Models in Unit Test Generation

This paper presents the first empirical study on unit test generation using large language models (LLMs). The study evaluates five open-source LLMs with different structures and parameter sizes on 17 Java projects from the Defects4J benchmark. The research questions focus on the influence of prompt design, in-context learning methods, and the performance of LLMs compared to commercial GPT-4 and traditional Evosuite. Key findings include: 1. **Prompt Design**: The description style and selected code features significantly impact the effectiveness of LLMs in unit test generation. Natural language description (NL) is generally more effective than code language description (CL) for some LLMs, while the choice of code features should balance the need for code comprehension and the space left for generating tests. 2. **In-Context Learning Methods**: Chain-of-Thoughts (CoT) and Retrieval Augmented Generation (RAG) methods do not consistently improve the effectiveness of LLM-based unit test generation. CoT is particularly effective for LLMs with strong code comprehension abilities, while RAG is less effective due to the gap between retrieved and generated unit tests. 3. **Performance Comparison**: Larger-scale LLMs generally outperform smaller-scale ones, but the best LLM for unit test generation may vary depending on the specific task. Commercial GPT-4 generally performs better than the studied open-source LLMs, and traditional Evosuite significantly outperforms all LLM-based techniques in terms of test coverage. 4. **Defect Detection**: The defect detection ability of LLM-generated unit tests is limited due to their low validity. On average, 87.13% of defects cannot be detected by LLMs, and the remaining 47.28% are detected with a high error rate. The study provides actionable guidelines for future research and practical use, emphasizing the importance of tuning prompt design, selecting appropriate LLMs, and addressing the limitations of LLM-based unit test generation.This paper presents the first empirical study on unit test generation using large language models (LLMs). The study evaluates five open-source LLMs with different structures and parameter sizes on 17 Java projects from the Defects4J benchmark. The research questions focus on the influence of prompt design, in-context learning methods, and the performance of LLMs compared to commercial GPT-4 and traditional Evosuite. Key findings include: 1. **Prompt Design**: The description style and selected code features significantly impact the effectiveness of LLMs in unit test generation. Natural language description (NL) is generally more effective than code language description (CL) for some LLMs, while the choice of code features should balance the need for code comprehension and the space left for generating tests. 2. **In-Context Learning Methods**: Chain-of-Thoughts (CoT) and Retrieval Augmented Generation (RAG) methods do not consistently improve the effectiveness of LLM-based unit test generation. CoT is particularly effective for LLMs with strong code comprehension abilities, while RAG is less effective due to the gap between retrieved and generated unit tests. 3. **Performance Comparison**: Larger-scale LLMs generally outperform smaller-scale ones, but the best LLM for unit test generation may vary depending on the specific task. Commercial GPT-4 generally performs better than the studied open-source LLMs, and traditional Evosuite significantly outperforms all LLM-based techniques in terms of test coverage. 4. **Defect Detection**: The defect detection ability of LLM-generated unit tests is limited due to their low validity. On average, 87.13% of defects cannot be detected by LLMs, and the remaining 47.28% are detected with a high error rate. The study provides actionable guidelines for future research and practical use, emphasizing the importance of tuning prompt design, selecting appropriate LLMs, and addressing the limitations of LLM-based unit test generation.

An Empirical Study of Unit Test Generation with Large Language Models.

26 Jun 2024 | Lin Yang, Chen Yang, Shutao Gao, Weijng Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, Junjie Chen