What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

8 Jul 2024 | Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
This paper investigates the limitations and performance of large language models (LLMs) in code generation. The study evaluates three closed-source and four open-source LLMs on three widely used benchmarks, revealing that LLMs struggle to generate correct code for complex problems, often producing shorter but more complex code than canonical solutions. A taxonomy of bugs in incorrect code is developed, categorizing them into three primary types (Syntax Bug, Runtime Bug, Functional Bug) and 12 secondary types. The analysis shows that functional bugs are the most common, while syntax bugs are the least. Additionally, the study constructs a real-world benchmark (RWPB) with 140 code generation tasks from GitHub repositories, highlighting differences in bug distributions between real-world scenarios and existing benchmarks. A novel training-free iterative method is proposed, enabling LLMs to self-critique and correct their generated code based on bug types and compiler feedback. Experimental results show that this method can significantly reduce bugs and increase the passing rate by 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex tasks. The study also finds that closed-source models outperform open-source models, particularly in complex tasks, and that LLMs often fail to generate optimal algorithms, leading to timeout errors. The findings suggest that improving LLMs' comprehension capabilities is crucial for enhancing code generation accuracy.This paper investigates the limitations and performance of large language models (LLMs) in code generation. The study evaluates three closed-source and four open-source LLMs on three widely used benchmarks, revealing that LLMs struggle to generate correct code for complex problems, often producing shorter but more complex code than canonical solutions. A taxonomy of bugs in incorrect code is developed, categorizing them into three primary types (Syntax Bug, Runtime Bug, Functional Bug) and 12 secondary types. The analysis shows that functional bugs are the most common, while syntax bugs are the least. Additionally, the study constructs a real-world benchmark (RWPB) with 140 code generation tasks from GitHub repositories, highlighting differences in bug distributions between real-world scenarios and existing benchmarks. A novel training-free iterative method is proposed, enabling LLMs to self-critique and correct their generated code based on bug types and compiler feedback. Experimental results show that this method can significantly reduce bugs and increase the passing rate by 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex tasks. The study also finds that closed-source models outperform open-source models, particularly in complex tasks, and that LLMs often fail to generate optimal algorithms, leading to timeout errors. The findings suggest that improving LLMs' comprehension capabilities is crucial for enhancing code generation accuracy.
Reach us at info@study.space
Understanding What's Wrong with Your Code Generated by Large Language Models%3F An Extensive Study