19 May 2024 | Subhankar Maity, Aniket Droy, Sudeshna Sarkar
This paper explores the capabilities of prompted large language models (LLMs) in educational and assessment applications. The study investigates the effectiveness of prompt-based techniques in generating open-ended questions from school-level and undergraduate-level textbooks, as well as the feasibility of using a chain-of-thought inspired multi-stage prompting approach for language-agnostic multiple-choice question (MCQ) generation. Additionally, the study evaluates the ability of prompted LLMs to explain Bengali grammatical errors and assess human resource (HR) spoken interview transcripts. The research compares the performance of LLMs with human experts across various educational tasks and domains, aiming to highlight the potential and limitations of LLMs in reshaping educational practices.
The study addresses five key research questions: (1) How effective are prompt-based techniques in generating open-ended questions from school-level textbooks compared to human experts? (2) How effective are prompt-based techniques in generating open-ended questions from undergraduate-level technical textbooks compared to human experts? (3) Can a chain-of-thought inspired multi-stage prompting approach be developed to generate language-agnostic MCQs using GPT-based models? (4) To what extent are pre-trained LLMs capable of explaining Bengali grammatical errors compared to human experts? (5) How ready are pre-trained LLMs to assess HR spoken interview transcripts compared to human experts?
The study finds that while LLMs show promise in generating questions and MCQs, they still fall short of human performance in many areas. For example, T5 with a long prompt outperforms other LLMs in automated evaluation metrics, but still lags behind human experts. Similarly, text-davinci-003 performs well in human evaluation metrics but is not as effective as human experts in generating high-quality questions. The study also highlights the limitations of current LLMs in explaining Bengali grammatical errors and assessing HR interview transcripts, emphasizing the need for human intervention in these areas.
Overall, the study underscores the potential of LLMs in educational and assessment applications but also highlights the need for further research and refinement to fully harness their capabilities. The study concludes that while LLMs can assist in various educational tasks, they are not yet fully equipped for automatic deployment in all areas, and a human-in-the-loop approach is necessary to ensure accuracy and quality.This paper explores the capabilities of prompted large language models (LLMs) in educational and assessment applications. The study investigates the effectiveness of prompt-based techniques in generating open-ended questions from school-level and undergraduate-level textbooks, as well as the feasibility of using a chain-of-thought inspired multi-stage prompting approach for language-agnostic multiple-choice question (MCQ) generation. Additionally, the study evaluates the ability of prompted LLMs to explain Bengali grammatical errors and assess human resource (HR) spoken interview transcripts. The research compares the performance of LLMs with human experts across various educational tasks and domains, aiming to highlight the potential and limitations of LLMs in reshaping educational practices.
The study addresses five key research questions: (1) How effective are prompt-based techniques in generating open-ended questions from school-level textbooks compared to human experts? (2) How effective are prompt-based techniques in generating open-ended questions from undergraduate-level technical textbooks compared to human experts? (3) Can a chain-of-thought inspired multi-stage prompting approach be developed to generate language-agnostic MCQs using GPT-based models? (4) To what extent are pre-trained LLMs capable of explaining Bengali grammatical errors compared to human experts? (5) How ready are pre-trained LLMs to assess HR spoken interview transcripts compared to human experts?
The study finds that while LLMs show promise in generating questions and MCQs, they still fall short of human performance in many areas. For example, T5 with a long prompt outperforms other LLMs in automated evaluation metrics, but still lags behind human experts. Similarly, text-davinci-003 performs well in human evaluation metrics but is not as effective as human experts in generating high-quality questions. The study also highlights the limitations of current LLMs in explaining Bengali grammatical errors and assessing HR interview transcripts, emphasizing the need for human intervention in these areas.
Overall, the study underscores the potential of LLMs in educational and assessment applications but also highlights the need for further research and refinement to fully harness their capabilities. The study concludes that while LLMs can assist in various educational tasks, they are not yet fully equipped for automatic deployment in all areas, and a human-in-the-loop approach is necessary to ensure accuracy and quality.