[slides and audio] Program Code Generation with Generative AIs

The paper "Program Code Generation with Generative AIs" by Baskhad Idrisov and Tim Schlippe compares the correctness, efficiency, and maintainability of program code generated by humans and AI. The study evaluates six LeetCode problems in Java, Python, and C++ using seven state-of-the-art generative AIs: GitHub Copilot (GPT-3.0), BingAI Chat (GPT-4.0), ChatGPT (GPT-3.5), Code Llama (Llama 2), StarCoder, InstructCodeT5+, and CodeWhisperer. The evaluation metrics include time and space complexity, runtime, memory usage, lines of code, cyclomatic complexity, Halstead complexity, and maintainability index. The results show that GitHub Copilot performed best, solving 9 out of 18 problems (50.0%), while CodeWhisperer solved none. BingAI Chat generated correct solutions for seven problems (38.9%), ChatGPT and Code Llama for four problems (22.2%), and StarCoder and InstructCodeT5+ for one problem (5.6%). Surprisingly, ChatGPT was the only AI capable of solving a hard-level problem. Overall, 26 AI-generated codes (20.6%) were correct, and for 11 incorrect codes (8.7%), minimal modifications were needed, saving between 8.9% and 71.3% of time compared to writing from scratch. The study also introduces the AI/Human-Generated Program Code Dataset, which includes 126 AI-generated and 18 human-generated codes, and discusses the potential of incorrect AI-generated code to be easily corrected. The authors conclude that while AI-generated code shows promise, it still needs improvement in correctness, efficiency, and maintainability. Future work will focus on expanding the dataset to more programming languages and coding problems, optimizing prompting strategies, and exploring the integration of multiple AIs in software development workflows.The paper "Program Code Generation with Generative AIs" by Baskhad Idrisov and Tim Schlippe compares the correctness, efficiency, and maintainability of program code generated by humans and AI. The study evaluates six LeetCode problems in Java, Python, and C++ using seven state-of-the-art generative AIs: GitHub Copilot (GPT-3.0), BingAI Chat (GPT-4.0), ChatGPT (GPT-3.5), Code Llama (Llama 2), StarCoder, InstructCodeT5+, and CodeWhisperer. The evaluation metrics include time and space complexity, runtime, memory usage, lines of code, cyclomatic complexity, Halstead complexity, and maintainability index. The results show that GitHub Copilot performed best, solving 9 out of 18 problems (50.0%), while CodeWhisperer solved none. BingAI Chat generated correct solutions for seven problems (38.9%), ChatGPT and Code Llama for four problems (22.2%), and StarCoder and InstructCodeT5+ for one problem (5.6%). Surprisingly, ChatGPT was the only AI capable of solving a hard-level problem. Overall, 26 AI-generated codes (20.6%) were correct, and for 11 incorrect codes (8.7%), minimal modifications were needed, saving between 8.9% and 71.3% of time compared to writing from scratch. The study also introduces the AI/Human-Generated Program Code Dataset, which includes 126 AI-generated and 18 human-generated codes, and discusses the potential of incorrect AI-generated code to be easily corrected. The authors conclude that while AI-generated code shows promise, it still needs improvement in correctness, efficiency, and maintainability. Future work will focus on expanding the dataset to more programming languages and coding problems, optimizing prompting strategies, and exploring the integration of multiple AIs in software development workflows.

Program Code Generation with Generative AIs

2024 | Baskhad Idrisov and Tim Schlippe