This paper investigates the ability of Large Language Models (LLMs), specifically ChatGPT and Llama, to automatically score written essays. The authors explore whether these models can effectively analyze and score essays, comparing their performance to state-of-the-art (SOTA) models. They design four different prompts to enhance the models' performance and evaluate their effectiveness on the ASAP dataset, which includes 8 tasks and 12,978 essays.
Key findings include:
1. **Prompt Engineering**: The choice of prompt significantly impacts the models' performance, with ChatGPT showing better consistency across different prompts compared to Llama.
2. **Performance**: Both LLMs exhibit comparable average performance in AES, with ChatGPT slightly outperforming Llama. However, their performance lags behind SOTA models.
3. **Consistency**: Llama shows significant inconsistency in scoring across different prompts, while ChatGPT demonstrates higher agreement and consistency.
4. **Feedback**: Despite their scoring limitations, both LLMs provide valuable feedback to improve essay quality, which is particularly useful for teachers and students.
The study highlights the importance of prompt engineering and the need for further research to improve LLMs' performance in automated essay scoring. The authors also plan to explore other LLMs and expand their analysis to larger datasets.This paper investigates the ability of Large Language Models (LLMs), specifically ChatGPT and Llama, to automatically score written essays. The authors explore whether these models can effectively analyze and score essays, comparing their performance to state-of-the-art (SOTA) models. They design four different prompts to enhance the models' performance and evaluate their effectiveness on the ASAP dataset, which includes 8 tasks and 12,978 essays.
Key findings include:
1. **Prompt Engineering**: The choice of prompt significantly impacts the models' performance, with ChatGPT showing better consistency across different prompts compared to Llama.
2. **Performance**: Both LLMs exhibit comparable average performance in AES, with ChatGPT slightly outperforming Llama. However, their performance lags behind SOTA models.
3. **Consistency**: Llama shows significant inconsistency in scoring across different prompts, while ChatGPT demonstrates higher agreement and consistency.
4. **Feedback**: Despite their scoring limitations, both LLMs provide valuable feedback to improve essay quality, which is particularly useful for teachers and students.
The study highlights the importance of prompt engineering and the need for further research to improve LLMs' performance in automated essay scoring. The authors also plan to explore other LLMs and expand their analysis to larger datasets.