Understanding Large Language Models as Test Case Generators%3A Performance Evaluation and Enhancement

The paper "Large Language Models as Test Case Generators: Performance Evaluation and Enhancement" by Kefan Li and Yuan Yuan from Beihang University explores the use of Large Language Models (LLMs) for generating high-quality test cases. While LLMs have shown significant progress in code generation, their performance in test case generation has not been thoroughly examined. The authors conduct extensive experiments to evaluate the quality of test cases generated by LLMs, finding that state-of-the-art models struggle with more complex problems, leading to a decline in accuracy. To address this, they propose TestChain, a multi-agent framework that decouples the generation of test inputs and outputs. TestChain uses a ReAct format conversation chain to interact with a Python interpreter, improving the accuracy of test outputs. The results show that TestChain significantly outperforms the baseline, achieving a 13.84% improvement in accuracy on the LeetCode-hard dataset using GPT-4 as the backbone model. The paper also discusses the limitations of the current approach and suggests future directions for enhancing the performance of weaker models in test case generation.The paper "Large Language Models as Test Case Generators: Performance Evaluation and Enhancement" by Kefan Li and Yuan Yuan from Beihang University explores the use of Large Language Models (LLMs) for generating high-quality test cases. While LLMs have shown significant progress in code generation, their performance in test case generation has not been thoroughly examined. The authors conduct extensive experiments to evaluate the quality of test cases generated by LLMs, finding that state-of-the-art models struggle with more complex problems, leading to a decline in accuracy. To address this, they propose TestChain, a multi-agent framework that decouples the generation of test inputs and outputs. TestChain uses a ReAct format conversation chain to interact with a Python interpreter, improving the accuracy of test outputs. The results show that TestChain significantly outperforms the baseline, achieving a 13.84% improvement in accuracy on the LeetCode-hard dataset using GPT-4 as the backbone model. The paper also discusses the limitations of the current approach and suggests future directions for enhancing the performance of weaker models in test case generation.

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

20 Apr 2024 | Kefan Li, Yuan Yuan