Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

20 Apr 2024 | Kefan Li, Yuan Yuan
This paper evaluates the performance of Large Language Models (LLMs) in generating test cases and proposes a multi-agent framework called TestChain to enhance their effectiveness. LLMs have shown significant progress in code generation, but their ability to generate high-quality test cases remains underexplored. The study finds that as problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases due to limitations in computation and reasoning. To address this, the authors propose TestChain, which decouples test input and output generation and uses a ReAct format conversation chain to interact with a Python interpreter, improving the accuracy of test outputs. The experiments show that TestChain significantly outperforms the baseline in terms of test case accuracy, particularly on the LeetCode-hard dataset, where TestChain using GPT-4 achieves a 13.84% improvement over the baseline. The framework also improves line coverage and reduces the number of incorrect test cases, especially those related to assertion errors. The results indicate that the interaction with the Python interpreter is crucial for enhancing the performance of LLMs in test case generation. The study highlights the importance of accurate test case generation for ensuring code quality and reliability. While LLMs can generate a large number of test cases for relatively easy problems, they struggle with more complex ones. The proposed TestChain framework addresses this by decomposing the test case generation process into two sequential sub-tasks: test input generation and test output generation, using two agents to handle each task. This approach reduces the complexity of the input-output mapping and improves the accuracy of the generated test cases. The paper also discusses the limitations of the proposed framework, noting that it requires high model capabilities and is tailored for robust models like GPT-3.5 and GPT-4. Future work could explore ways to enhance the performance of weaker models in test case generation using the TestChain paradigm. Overall, the study demonstrates the potential of TestChain in generating high-quality test cases that are both accurate and reliable.This paper evaluates the performance of Large Language Models (LLMs) in generating test cases and proposes a multi-agent framework called TestChain to enhance their effectiveness. LLMs have shown significant progress in code generation, but their ability to generate high-quality test cases remains underexplored. The study finds that as problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases due to limitations in computation and reasoning. To address this, the authors propose TestChain, which decouples test input and output generation and uses a ReAct format conversation chain to interact with a Python interpreter, improving the accuracy of test outputs. The experiments show that TestChain significantly outperforms the baseline in terms of test case accuracy, particularly on the LeetCode-hard dataset, where TestChain using GPT-4 achieves a 13.84% improvement over the baseline. The framework also improves line coverage and reduces the number of incorrect test cases, especially those related to assertion errors. The results indicate that the interaction with the Python interpreter is crucial for enhancing the performance of LLMs in test case generation. The study highlights the importance of accurate test case generation for ensuring code quality and reliability. While LLMs can generate a large number of test cases for relatively easy problems, they struggle with more complex ones. The proposed TestChain framework addresses this by decomposing the test case generation process into two sequential sub-tasks: test input generation and test output generation, using two agents to handle each task. This approach reduces the complexity of the input-output mapping and improves the accuracy of the generated test cases. The paper also discusses the limitations of the proposed framework, noting that it requires high model capabilities and is tailored for robust models like GPT-3.5 and GPT-4. Future work could explore ways to enhance the performance of weaker models in test case generation using the TestChain paradigm. Overall, the study demonstrates the potential of TestChain in generating high-quality test cases that are both accurate and reliable.
Reach us at info@study.space