Automatic Short Answer Grading for Finnish with ChatGPT

Automatic Short Answer Grading for Finnish with ChatGPT

2024 | Li-Hsin Chang, Filip Ginter
This paper evaluates the performance of two large language models (GPT-3.5 and GPT-4) in automatic short answer grading (ASAG) for Finnish student responses. The study uses 2000 student answers from ten undergraduate courses, focusing on short-answer questions. The models are tested in zero-shot and one-shot settings, with GPT-4 outperforming GPT-3.5 in 44% of one-shot cases, achieving a QWK score of 0.6+ in those instances. However, the results show a negative correlation between answer length and model performance, as well as a positive correlation between prediction standard deviation and lower performance. The study concludes that while GPT-4 shows promise as a grader, further research is needed before deploying it as a reliable autograder. The study explores the suitability of LLMs for ASAG, testing ChatGPT based on GPT-3.5 and GPT-4 on 2000 Finnish short-answer questions from ten bachelor-level courses. The experimental design aims to determine if direct use of ChatGPT in summative assessments by educators is feasible. The results show that while one-shot GPT-4 achieves good QWK scores on 44 of 100 tested questions, further research is needed before deploying an LLM-based short-answer grader. The study also examines the performance of GPT-3.5 and GPT-4 in various metrics, including QWK, TAA, and RMC. GPT-4 outperforms GPT-3.5 in most metrics, particularly in one-shot settings. However, the results show that the models tend to assign higher scores than human evaluators, and there is a negative correlation between answer length and model performance. The study also finds that the models perform better on shorter answers and questions with a narrower scope. The study highlights the challenges of using LLMs for ASAG, including the need for more research, the limitations of current models, and the potential for bias in grading. The study also discusses the implications of using LLMs for ASAG, including the potential for overestimation of model accuracy by students and the need for human intervention in certain cases. The study concludes that while LLMs show promise for ASAG, further research is needed to ensure their reliability and effectiveness.This paper evaluates the performance of two large language models (GPT-3.5 and GPT-4) in automatic short answer grading (ASAG) for Finnish student responses. The study uses 2000 student answers from ten undergraduate courses, focusing on short-answer questions. The models are tested in zero-shot and one-shot settings, with GPT-4 outperforming GPT-3.5 in 44% of one-shot cases, achieving a QWK score of 0.6+ in those instances. However, the results show a negative correlation between answer length and model performance, as well as a positive correlation between prediction standard deviation and lower performance. The study concludes that while GPT-4 shows promise as a grader, further research is needed before deploying it as a reliable autograder. The study explores the suitability of LLMs for ASAG, testing ChatGPT based on GPT-3.5 and GPT-4 on 2000 Finnish short-answer questions from ten bachelor-level courses. The experimental design aims to determine if direct use of ChatGPT in summative assessments by educators is feasible. The results show that while one-shot GPT-4 achieves good QWK scores on 44 of 100 tested questions, further research is needed before deploying an LLM-based short-answer grader. The study also examines the performance of GPT-3.5 and GPT-4 in various metrics, including QWK, TAA, and RMC. GPT-4 outperforms GPT-3.5 in most metrics, particularly in one-shot settings. However, the results show that the models tend to assign higher scores than human evaluators, and there is a negative correlation between answer length and model performance. The study also finds that the models perform better on shorter answers and questions with a narrower scope. The study highlights the challenges of using LLMs for ASAG, including the need for more research, the limitations of current models, and the potential for bias in grading. The study also discusses the implications of using LLMs for ASAG, including the potential for overestimation of model accuracy by students and the need for human intervention in certain cases. The study concludes that while LLMs show promise for ASAG, further research is needed to ensure their reliability and effectiveness.
Reach us at info@study.space