July 8–10, 2024, Milan, Italy | Imen Azaiz, Natalie Kiesler, Sven Strickroth
This paper explores the quality of feedback generated by GPT-4 Turbo for programming exercises, focusing on two assignments from an introductory programming course. The study uses a qualitative thematic analysis to evaluate the feedback based on correctness, personalization, fault localization, and other features. Compared to previous versions of LLMs (e.g., GPT-3.5), GPT-4 Turbo shows significant improvements, including more structured and consistent output, accurate identification of invalid casing, and the inclusion of student program outputs in some cases. However, the feedback also exhibits inconsistencies, such as incorrect error classifications and redundant suggestions. The research highlights the potential and limitations of LLMs in e-assessment systems and pedagogical scenarios, emphasizing the need for further research to integrate LLMs effectively into educational contexts. The findings suggest that while GPT-4 Turbo provides valuable feedback, it may not be suitable for unguided use by students or for broad implementation without careful consideration of its limitations and potential biases.This paper explores the quality of feedback generated by GPT-4 Turbo for programming exercises, focusing on two assignments from an introductory programming course. The study uses a qualitative thematic analysis to evaluate the feedback based on correctness, personalization, fault localization, and other features. Compared to previous versions of LLMs (e.g., GPT-3.5), GPT-4 Turbo shows significant improvements, including more structured and consistent output, accurate identification of invalid casing, and the inclusion of student program outputs in some cases. However, the feedback also exhibits inconsistencies, such as incorrect error classifications and redundant suggestions. The research highlights the potential and limitations of LLMs in e-assessment systems and pedagogical scenarios, emphasizing the need for further research to integrate LLMs effectively into educational contexts. The findings suggest that while GPT-4 Turbo provides valuable feedback, it may not be suitable for unguided use by students or for broad implementation without careful consideration of its limitations and potential biases.