[slides] Assessing AI Detectors in Identifying AI-Generated Code%3A Implications for Education

This paper presents an empirical study to evaluate the performance of five AI-generated content (AIGC) detectors in identifying AI-generated code. The study aims to address the growing concern among educators about the use of Large Language Models (LLMs) like ChatGPT in programming education and the potential for academic misconduct. The researchers collected a dataset of 5,069 samples, including coding problems and human-written Python solutions, from various sources such as Questocl, Kaggle, and LeetCode. They created 13 sets of code problem variant prompts to instruct ChatGPT to generate outputs, and then assessed the performance of the five AIGC detectors. The results show that existing AIGC detectors perform poorly in distinguishing between human-written code and AI-generated code. The detectors exhibit low accuracy, high false positive rates, and significant sensitivity to specific code variants. For example, the GLTR detector shows a low p-value of 0.0296, indicating significant performance differences across variants. The study also highlights the limitations of each detector, such as GPTZero's struggle with the syntax and structure of programming languages, Sapling's focus on text-based content, and GLTR's sensitivity to specific patterns. The paper concludes by emphasizing the need for further research and development to enhance the efficacy of AIGC detectors, particularly in the context of code-based content. It provides recommendations for educators to define clear objectives, ensure ethical use, and stay updated with technological advancements to effectively detect and evaluate AIGC in educational materials. The findings underscore the importance of addressing the limitations of current detectors to maintain academic integrity and ensure fair grading in programming education.This paper presents an empirical study to evaluate the performance of five AI-generated content (AIGC) detectors in identifying AI-generated code. The study aims to address the growing concern among educators about the use of Large Language Models (LLMs) like ChatGPT in programming education and the potential for academic misconduct. The researchers collected a dataset of 5,069 samples, including coding problems and human-written Python solutions, from various sources such as Questocl, Kaggle, and LeetCode. They created 13 sets of code problem variant prompts to instruct ChatGPT to generate outputs, and then assessed the performance of the five AIGC detectors. The results show that existing AIGC detectors perform poorly in distinguishing between human-written code and AI-generated code. The detectors exhibit low accuracy, high false positive rates, and significant sensitivity to specific code variants. For example, the GLTR detector shows a low p-value of 0.0296, indicating significant performance differences across variants. The study also highlights the limitations of each detector, such as GPTZero's struggle with the syntax and structure of programming languages, Sapling's focus on text-based content, and GLTR's sensitivity to specific patterns. The paper concludes by emphasizing the need for further research and development to enhance the efficacy of AIGC detectors, particularly in the context of code-based content. It provides recommendations for educators to define clear objectives, ensure ethical use, and stay updated with technological advancements to effectively detect and evaluate AIGC in educational materials. The findings underscore the importance of addressing the limitations of current detectors to maintain academic integrity and ensure fair grading in programming education.

Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education

2024 | Wei Hung Pan, Ming Jie Chok, Jonathan Leong Shan Wong, Yung Xin Shin, Yeong Shian Poon, Zhou Yang, Chun Yong Chong, David Lo, Mei Kuan Lim