This paper evaluates the effectiveness of large language models (LLMs) in automatic short answer grading (ASAG) using two LLM-based chatbots, GPT-3.5 and GPT-4, on a dataset of 2000 Finnish student answers from ten undergraduate courses. The study aims to assess whether these models can be reliable tools for educators in grading short answers. The evaluation is conducted under zero-shot and one-shot settings, focusing on multiple perspectives: grading system developers, teachers, and students.
Key findings include:
- GPT-4 achieves a good QWK score (0.6+) in 44% of one-shot settings, outperforming GPT-3.5 at 21%.
- There is a negative association between student answer length and model performance.
- A smaller standard deviation among predictions is correlated with lower performance.
The study concludes that while GPT-4 shows promise as a capable grader, further research is necessary before it can be deployed as a reliable autograder. The paper also discusses the limitations of the study, such as the lack of grading criteria and reference answers, and suggests areas for future research, including prompt engineering, enhancing question clarity, and providing comprehensive information to the models.This paper evaluates the effectiveness of large language models (LLMs) in automatic short answer grading (ASAG) using two LLM-based chatbots, GPT-3.5 and GPT-4, on a dataset of 2000 Finnish student answers from ten undergraduate courses. The study aims to assess whether these models can be reliable tools for educators in grading short answers. The evaluation is conducted under zero-shot and one-shot settings, focusing on multiple perspectives: grading system developers, teachers, and students.
Key findings include:
- GPT-4 achieves a good QWK score (0.6+) in 44% of one-shot settings, outperforming GPT-3.5 at 21%.
- There is a negative association between student answer length and model performance.
- A smaller standard deviation among predictions is correlated with lower performance.
The study concludes that while GPT-4 shows promise as a capable grader, further research is necessary before it can be deployed as a reliable autograder. The paper also discusses the limitations of the study, such as the lack of grading criteria and reference answers, and suggests areas for future research, including prompt engineering, enhancing question clarity, and providing comprehensive information to the models.