What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study

What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study

8 Jul 2024 | Shihan Dou1*, Haoxiang Jia3*, Shenxi Wu1, Huiyuan Zheng1, Weikang Zhou1, Muling Wu1, Mingxu Chai1, Jessica Fan5, Caishuang Huang1, Yunbo Tao1, Yan Liu1, Enyu Zhou1, Ming Zhang1, Yuhao Zhou1, Yueming Wu4, Rui Zheng1, Ming Wen2††, Rongxiang Weng6, Jingang Wang6, Xunliang Cai6, Tao Gui1††, Xipeng Qiu1, Qi Zhang1, Xuanjing Huang1
The paper "What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study" by Shihan Dou et al. explores the limitations and challenges of large language models (LLMs) in code generation. The study evaluates the performance of three closed-source LLMs and four open-source LLMs on three common benchmarks, focusing on the length, cyclomatic complexity, and API number of the generated code. Key findings include: 1. **Performance and Complexity**: LLMs struggle with more complex problems, producing shorter but more complicated code compared to canonical solutions. 2. **Bug Taxonomy**: A taxonomy of bugs is developed, categorizing them into three primary types (Syntax Bug, Runtime Bug, Functional Bug) and 12 sub-categories, with functional bugs being the most prevalent. 3. **Real-World Benchmark**: A real-world benchmark, *RWPR*, is created using 140 code generation tasks from GitHub repositories, highlighting differences in bug distributions between real-world scenarios and existing benchmarks. 4. **Bug Mitigation**: A novel training-free iterative method is proposed, enabling LLMs to self-critique and correct their generated code based on bug types and compiler feedback, significantly improving the passing rate by 29.2% after two iterations. The study emphasizes the need for more comprehensive empirical evaluations and suggests that closed-source models generally outperform open-source models in handling complex tasks, particularly in reducing syntax and runtime bugs. The findings provide valuable insights for improving LLMs' capabilities in code generation and real-world applications.The paper "What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study" by Shihan Dou et al. explores the limitations and challenges of large language models (LLMs) in code generation. The study evaluates the performance of three closed-source LLMs and four open-source LLMs on three common benchmarks, focusing on the length, cyclomatic complexity, and API number of the generated code. Key findings include: 1. **Performance and Complexity**: LLMs struggle with more complex problems, producing shorter but more complicated code compared to canonical solutions. 2. **Bug Taxonomy**: A taxonomy of bugs is developed, categorizing them into three primary types (Syntax Bug, Runtime Bug, Functional Bug) and 12 sub-categories, with functional bugs being the most prevalent. 3. **Real-World Benchmark**: A real-world benchmark, *RWPR*, is created using 140 code generation tasks from GitHub repositories, highlighting differences in bug distributions between real-world scenarios and existing benchmarks. 4. **Bug Mitigation**: A novel training-free iterative method is proposed, enabling LLMs to self-critique and correct their generated code based on bug types and compiler feedback, significantly improving the passing rate by 29.2% after two iterations. The study emphasizes the need for more comprehensive empirical evaluations and suggests that closed-source models generally outperform open-source models in handling complex tasks, particularly in reducing syntax and runtime bugs. The findings provide valuable insights for improving LLMs' capabilities in code generation and real-world applications.
Reach us at info@study.space