2024 | Wei Hung Pan, Ming Jie Chok, Jonathan Leong Shan Wong, Yung Xin Shin, Yeong Shian Poon, Zhou Yang, Chun Yong Chong, David Lo, Mei Kuan Lim
This paper presents an empirical study on the performance of AI-generated content (AIGC) detectors in identifying AI-generated code, with implications for education. The study evaluates five AIGC detectors using 5,069 samples of coding problems and their corresponding human-written and AI-generated code. The dataset was collected from various sources, including Kaggle, LeetCode, and Quescol, and was transformed into 13 variants to test the detectors' ability to distinguish between human and AI-generated code. The results show that existing AIGC detectors perform poorly in identifying AI-generated code, with accuracy hovering around 0.5 in most cases. This indicates a significant limitation in current AIGC detection tools, particularly in the context of programming education. The study also highlights the need for further research and development in this area to improve the effectiveness of AIGC detectors. The findings suggest that educators should be cautious in relying solely on AIGC detectors for academic integrity assessments, as they may not be reliable in detecting AI-generated code. The study also discusses the limitations of existing AIGC detectors, including their sensitivity to code variants and their inability to accurately distinguish between human and AI-generated code. The results emphasize the importance of developing more specialized tools and algorithms for detecting AI-generated code in educational settings. The study concludes that further research is needed to address the limitations of current AIGC detectors and to ensure the integrity of academic assessments in the context of AI-generated code.This paper presents an empirical study on the performance of AI-generated content (AIGC) detectors in identifying AI-generated code, with implications for education. The study evaluates five AIGC detectors using 5,069 samples of coding problems and their corresponding human-written and AI-generated code. The dataset was collected from various sources, including Kaggle, LeetCode, and Quescol, and was transformed into 13 variants to test the detectors' ability to distinguish between human and AI-generated code. The results show that existing AIGC detectors perform poorly in identifying AI-generated code, with accuracy hovering around 0.5 in most cases. This indicates a significant limitation in current AIGC detection tools, particularly in the context of programming education. The study also highlights the need for further research and development in this area to improve the effectiveness of AIGC detectors. The findings suggest that educators should be cautious in relying solely on AIGC detectors for academic integrity assessments, as they may not be reliable in detecting AI-generated code. The study also discusses the limitations of existing AIGC detectors, including their sensitivity to code variants and their inability to accurately distinguish between human and AI-generated code. The results emphasize the importance of developing more specialized tools and algorithms for detecting AI-generated code in educational settings. The study concludes that further research is needed to address the limitations of current AIGC detectors and to ensure the integrity of academic assessments in the context of AI-generated code.