ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

30 Jul 2024 | Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, Timothy Baldwin
ArabicMMLU is a new benchmark for assessing massive multitask language understanding in Arabic, consisting of 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), sourced from school exams across various educational levels in North Africa, the Levant, and the Gulf regions. The dataset was carefully constructed by collaborating with native Arabic speakers from Jordan, Egypt, UAE, Lebanon, and Saudi Arabia, ensuring rich local context, particularly in subjects like history, geography, law, civics education, and driving tests. The dataset covers a wide range of subjects and educational levels, with primary school, middle school, high school, and university level questions accounting for 22.2%, 12.2%, 34%, and 6.1% respectively, with the remaining categorized as "NA". The dataset was evaluated across 35 models, revealing significant room for improvement, especially among the best open-source models. Notably, BLOOMZ, mT0, LLaMA2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%. The evaluation highlights the importance of regionally and culturally localized datasets for Arabic, as well as the challenges in assessing Arabic-specific knowledge. The study also explores the performance of models across different education levels and countries, showing that ArabicMMLU questions are more challenging at the high school level compared to primary and middle school levels. Additionally, the study investigates the impact of prompt language and few-shot learning on model performance, finding that models perform better with English prompts and few-shot learning. The analysis also reveals that model confidence is well-calibrated, with correlation scores above 0.9. The study further examines the impact of negation on model performance, finding that negated questions generally exhibit slightly lower accuracy, particularly in Biology and Economics. However, for Geography, the models actually achieve higher accuracy. The study concludes that ArabicMMLU is a valuable resource for evaluating the real-world knowledge and reasoning capabilities of future Arabic LLMs, and suggests future work to include short-answer or essay questions, different modalities, and larger region coverage.ArabicMMLU is a new benchmark for assessing massive multitask language understanding in Arabic, consisting of 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), sourced from school exams across various educational levels in North Africa, the Levant, and the Gulf regions. The dataset was carefully constructed by collaborating with native Arabic speakers from Jordan, Egypt, UAE, Lebanon, and Saudi Arabia, ensuring rich local context, particularly in subjects like history, geography, law, civics education, and driving tests. The dataset covers a wide range of subjects and educational levels, with primary school, middle school, high school, and university level questions accounting for 22.2%, 12.2%, 34%, and 6.1% respectively, with the remaining categorized as "NA". The dataset was evaluated across 35 models, revealing significant room for improvement, especially among the best open-source models. Notably, BLOOMZ, mT0, LLaMA2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%. The evaluation highlights the importance of regionally and culturally localized datasets for Arabic, as well as the challenges in assessing Arabic-specific knowledge. The study also explores the performance of models across different education levels and countries, showing that ArabicMMLU questions are more challenging at the high school level compared to primary and middle school levels. Additionally, the study investigates the impact of prompt language and few-shot learning on model performance, finding that models perform better with English prompts and few-shot learning. The analysis also reveals that model confidence is well-calibrated, with correlation scores above 0.9. The study further examines the impact of negation on model performance, finding that negated questions generally exhibit slightly lower accuracy, particularly in Biology and Economics. However, for Geography, the models actually achieve higher accuracy. The study concludes that ArabicMMLU is a valuable resource for evaluating the real-world knowledge and reasoning capabilities of future Arabic LLMs, and suggests future work to include short-answer or essay questions, different modalities, and larger region coverage.
Reach us at info@study.space
[slides] ArabicMMLU%3A Assessing Massive Multitask Language Understanding in Arabic | StudySpace