Can Large Language Models Automatically Score Proficiency of Written Essays?

Can Large Language Models Automatically Score Proficiency of Written Essays?

16 Apr 2024 | Watheq Mansour, Salam Albatrani, Sohaila Eltanbouly, Tamer Elsayed
Can Large Language Models Automatically Score Proficiency of Written Essays? Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed Abstract: Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students. Keywords: ChatGPT, Llama, Automated Essay Scoring, Natural Language Processing This paper investigates whether large language models (LLMs) can automatically score the proficiency of written essays. We tested two LLMs, ChatGPT and Llama, on the ASAP dataset, which contains 8 tasks and 12978 essays. We designed four different prompts to evaluate their performance and found that the performance of LLMs is highly dependent on the prompt and the task type. ChatGPT showed slightly better performance than Llama, but both models were far behind state-of-the-art (SOTA) models in terms of scoring predictions. However, both models provided feedback that could help improve the quality of essays. Our results suggest that while LLMs may not yet be reliable for predicting essay scores, they have the potential to provide meaningful feedback for improving writing quality. We also found that the performance of LLMs is highly sensitive to the prompt, with different prompts leading to different results. This highlights the importance of prompt engineering in enhancing the performance of LLMs on the AES task. Our study also shows that LLMs are not yet reliable for predicting the score a student would get on an essay, but they have the potential to provide useful feedback for improving writing quality. We plan to further explore the use of LLMs in the AES task in future work.Can Large Language Models Automatically Score Proficiency of Written Essays? Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed Abstract: Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students. Keywords: ChatGPT, Llama, Automated Essay Scoring, Natural Language Processing This paper investigates whether large language models (LLMs) can automatically score the proficiency of written essays. We tested two LLMs, ChatGPT and Llama, on the ASAP dataset, which contains 8 tasks and 12978 essays. We designed four different prompts to evaluate their performance and found that the performance of LLMs is highly dependent on the prompt and the task type. ChatGPT showed slightly better performance than Llama, but both models were far behind state-of-the-art (SOTA) models in terms of scoring predictions. However, both models provided feedback that could help improve the quality of essays. Our results suggest that while LLMs may not yet be reliable for predicting essay scores, they have the potential to provide meaningful feedback for improving writing quality. We also found that the performance of LLMs is highly sensitive to the prompt, with different prompts leading to different results. This highlights the importance of prompt engineering in enhancing the performance of LLMs on the AES task. Our study also shows that LLMs are not yet reliable for predicting the score a student would get on an essay, but they have the potential to provide useful feedback for improving writing quality. We plan to further explore the use of LLMs in the AES task in future work.
Reach us at info@study.space
Understanding Can Large Language Models Automatically Score Proficiency of Written Essays%3F