The paper "What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study" by Shihan Dou et al. explores the limitations and challenges of large language models (LLMs) in code generation. The study evaluates the performance of three closed-source LLMs and four open-source LLMs on three common benchmarks, focusing on the length, cyclomatic complexity, and API number of the generated code. Key findings include:
1. **Performance and Complexity**: LLMs struggle with more complex problems, producing shorter but more complicated code compared to canonical solutions.
2. **Bug Taxonomy**: A taxonomy of bugs is developed, categorizing them into three primary types (Syntax Bug, Runtime Bug, Functional Bug) and 12 sub-categories, with functional bugs being the most prevalent.
3. **Real-World Benchmark**: A real-world benchmark, *RWPR*, is created using 140 code generation tasks from GitHub repositories, highlighting differences in bug distributions between real-world scenarios and existing benchmarks.
4. **Bug Mitigation**: A novel training-free iterative method is proposed, enabling LLMs to self-critique and correct their generated code based on bug types and compiler feedback, significantly improving the passing rate by 29.2% after two iterations.
The study emphasizes the need for more comprehensive empirical evaluations and suggests that closed-source models generally outperform open-source models in handling complex tasks, particularly in reducing syntax and runtime bugs. The findings provide valuable insights for improving LLMs' capabilities in code generation and real-world applications.The paper "What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study" by Shihan Dou et al. explores the limitations and challenges of large language models (LLMs) in code generation. The study evaluates the performance of three closed-source LLMs and four open-source LLMs on three common benchmarks, focusing on the length, cyclomatic complexity, and API number of the generated code. Key findings include:
1. **Performance and Complexity**: LLMs struggle with more complex problems, producing shorter but more complicated code compared to canonical solutions.
2. **Bug Taxonomy**: A taxonomy of bugs is developed, categorizing them into three primary types (Syntax Bug, Runtime Bug, Functional Bug) and 12 sub-categories, with functional bugs being the most prevalent.
3. **Real-World Benchmark**: A real-world benchmark, *RWPR*, is created using 140 code generation tasks from GitHub repositories, highlighting differences in bug distributions between real-world scenarios and existing benchmarks.
4. **Bug Mitigation**: A novel training-free iterative method is proposed, enabling LLMs to self-critique and correct their generated code based on bug types and compiler feedback, significantly improving the passing rate by 29.2% after two iterations.
The study emphasizes the need for more comprehensive empirical evaluations and suggests that closed-source models generally outperform open-source models in handling complex tasks, particularly in reducing syntax and runtime bugs. The findings provide valuable insights for improving LLMs' capabilities in code generation and real-world applications.