Program Code Generation with Generative AIs

Program Code Generation with Generative AIs

2024 | Baskhad Idrisov and Tim Schlippe
This study compares the correctness, efficiency, and maintainability of human-generated and AI-generated program code. Six LeetCode problems of varying difficulty were selected, resulting in 18 program codes generated by each of seven generative AIs. GitHub Copilot (Codex, GPT-3.0) performed best, solving 9 of the 18 problems (50.0%), while CodeWhisperer solved none. BingAI Chat (GPT-4.0) solved 7 (38.9%), ChatGPT and Code Llama solved 4 (22.2%), and StarCoder and InstructCodeT5+ solved 1 (5.6%). Surprisingly, ChatGPT was the only AI capable of solving a hard-level problem. Overall, 26 AI-generated codes (20.6%) solved the respective problem. For 11 AI-generated incorrect codes (8.7%), minimal modifications were needed to solve the problem, saving 8.9% to 71.3% of time compared to writing from scratch. The study evaluated AI-generated code using metrics such as lines of code, cyclomatic complexity, Halstead complexity, and maintainability index. AI-generated code was compared to human-generated code in terms of correctness, efficiency, and maintainability. The results showed that while AI-generated code often had lower lines of code, higher cyclomatic complexity, and similar or worse time and space complexity compared to human-generated code, some AI-generated code had better maintainability. For example, GH Copilot had the best performance in Java and C++ tasks, while BingAI performed best in Python. The study also analyzed the potential of incorrect AI-generated code to be corrected with minimal modifications. For 11 of the 24 potentially correct AI-generated codes, the time to correct (TTC) was less than the time to write the correct code from scratch, saving up to 71.3% of time. The study concludes that while AI-generated code is not always correct, efficient, or maintainable, it can be corrected with minimal effort in many cases. Future work includes expanding the dataset to other programming languages and exploring the optimal prompting strategies for each AI. The study also suggests that integrating AI into software development workflows could improve efficiency and productivity.This study compares the correctness, efficiency, and maintainability of human-generated and AI-generated program code. Six LeetCode problems of varying difficulty were selected, resulting in 18 program codes generated by each of seven generative AIs. GitHub Copilot (Codex, GPT-3.0) performed best, solving 9 of the 18 problems (50.0%), while CodeWhisperer solved none. BingAI Chat (GPT-4.0) solved 7 (38.9%), ChatGPT and Code Llama solved 4 (22.2%), and StarCoder and InstructCodeT5+ solved 1 (5.6%). Surprisingly, ChatGPT was the only AI capable of solving a hard-level problem. Overall, 26 AI-generated codes (20.6%) solved the respective problem. For 11 AI-generated incorrect codes (8.7%), minimal modifications were needed to solve the problem, saving 8.9% to 71.3% of time compared to writing from scratch. The study evaluated AI-generated code using metrics such as lines of code, cyclomatic complexity, Halstead complexity, and maintainability index. AI-generated code was compared to human-generated code in terms of correctness, efficiency, and maintainability. The results showed that while AI-generated code often had lower lines of code, higher cyclomatic complexity, and similar or worse time and space complexity compared to human-generated code, some AI-generated code had better maintainability. For example, GH Copilot had the best performance in Java and C++ tasks, while BingAI performed best in Python. The study also analyzed the potential of incorrect AI-generated code to be corrected with minimal modifications. For 11 of the 24 potentially correct AI-generated codes, the time to correct (TTC) was less than the time to write the correct code from scratch, saving up to 71.3% of time. The study concludes that while AI-generated code is not always correct, efficient, or maintainable, it can be corrected with minimal effort in many cases. Future work includes expanding the dataset to other programming languages and exploring the optimal prompting strategies for each AI. The study also suggests that integrating AI into software development workflows could improve efficiency and productivity.
Reach us at info@study.space
[slides and audio] Program Code Generation with Generative AIs