Feedback-Generation for Programming Exercises With GPT-4

Feedback-Generation for Programming Exercises With GPT-4

July 8-10, 2024 | Imen Azaiz, Natalie Kiesler, Sven Strickroth
This paper explores the quality of GPT-4 Turbo's generated feedback for programming exercises, focusing on its ability to provide formative feedback based on task descriptions and student submissions. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen student submissions. The output was analyzed qualitatively regarding correctness, personalization, fault localization, and other features. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements, such as more structured and consistent feedback. It can also accurately identify invalid cases in student programs' output. However, inconsistent feedback was noted, such as stating that the submission is correct but an error needs to be fixed. The study found that GPT-4 Turbo's feedback is more detailed and structured, with a higher accuracy rate compared to GPT-3.5. The feedback often includes code examples, explanations, and suggestions for improvements. However, there are still issues such as misleading feedback, incorrect explanations, redundancies, and inconsistencies in the generated feedback. The feedback is personalized and addresses the student's specific needs, but it can be complex for novices. The study also highlights the potential of GPT-4 Turbo for use in formative assessment systems, but it warns against its use without guidance or prior instruction. The research contributes to the understanding of LLMs' potential, limitations, and how to integrate them into assessment systems and pedagogical scenarios. The study also discusses the implications for educators and students using GPT-4-based applications.This paper explores the quality of GPT-4 Turbo's generated feedback for programming exercises, focusing on its ability to provide formative feedback based on task descriptions and student submissions. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen student submissions. The output was analyzed qualitatively regarding correctness, personalization, fault localization, and other features. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements, such as more structured and consistent feedback. It can also accurately identify invalid cases in student programs' output. However, inconsistent feedback was noted, such as stating that the submission is correct but an error needs to be fixed. The study found that GPT-4 Turbo's feedback is more detailed and structured, with a higher accuracy rate compared to GPT-3.5. The feedback often includes code examples, explanations, and suggestions for improvements. However, there are still issues such as misleading feedback, incorrect explanations, redundancies, and inconsistencies in the generated feedback. The feedback is personalized and addresses the student's specific needs, but it can be complex for novices. The study also highlights the potential of GPT-4 Turbo for use in formative assessment systems, but it warns against its use without guidance or prior instruction. The research contributes to the understanding of LLMs' potential, limitations, and how to integrate them into assessment systems and pedagogical scenarios. The study also discusses the implications for educators and students using GPT-4-based applications.
Reach us at info@study.space