LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation

LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation

2 Oct 2024 | Sarah Fakhoury*, Aaditya Naik†, Georgios Sakkas‡, Saikat Chakraborty* and Shuvendu K. Lahiri*
The paper introduces TiCODER, a novel interactive workflow designed to improve the accuracy of code generation from large language models (LLMs). TiCODER aims to address the challenge of clarifying user intent through tests, which can then be used to prune and rank code suggestions. The workflow consists of two variants: TiCODER-PASSFAIL and TiCODER-OUTPUT, which differ in the type of user feedback required. A mixed-methods user study with 15 programmers evaluated the effectiveness of TiCODER, finding that participants using TiCODER were more likely to correctly evaluate AI-generated code and reported significantly less cognitive load. Additionally, a large-scale evaluation on two Python datasets using four state-of-the-art LLMs showed an average absolute improvement of 45.97% in pass@1 code generation accuracy within 5 user interactions. The study also demonstrated that TiCODER can boost the accuracy of smaller models to levels comparable to larger models like GPT-4-32k. The paper concludes by discussing the implications of these findings for the broader research community and the limitations of the experiments.The paper introduces TiCODER, a novel interactive workflow designed to improve the accuracy of code generation from large language models (LLMs). TiCODER aims to address the challenge of clarifying user intent through tests, which can then be used to prune and rank code suggestions. The workflow consists of two variants: TiCODER-PASSFAIL and TiCODER-OUTPUT, which differ in the type of user feedback required. A mixed-methods user study with 15 programmers evaluated the effectiveness of TiCODER, finding that participants using TiCODER were more likely to correctly evaluate AI-generated code and reported significantly less cognitive load. Additionally, a large-scale evaluation on two Python datasets using four state-of-the-art LLMs showed an average absolute improvement of 45.97% in pass@1 code generation accuracy within 5 user interactions. The study also demonstrated that TiCODER can boost the accuracy of smaller models to levels comparable to larger models like GPT-4-32k. The paper concludes by discussing the implications of these findings for the broader research community and the limitations of the experiments.
Reach us at info@study.space
Understanding LLM-Based Test-Driven Interactive Code Generation%3A User Study and Empirical Evaluation