Understanding ArabicMMLU%3A Assessing Massive Multitask Language Understanding in Arabic

The paper introduces ArabicMMLU, the first multi-task language understanding benchmark for Arabic, sourced from school exams across diverse educational levels in North Africa, the Levant, and the Gulf regions. The dataset comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), constructed through collaboration with native speakers. Experiments with 35 models reveal significant room for improvement, particularly among open-source models. GPT-4 achieves the best performance, while open-source models struggle to achieve scores above 60%. The study highlights the need for more regionally and culturally localized datasets to evaluate Arabic language models effectively.The paper introduces ArabicMMLU, the first multi-task language understanding benchmark for Arabic, sourced from school exams across diverse educational levels in North Africa, the Levant, and the Gulf regions. The dataset comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), constructed through collaboration with native speakers. Experiments with 35 models reveal significant room for improvement, particularly among open-source models. GPT-4 achieves the best performance, while open-source models struggle to achieve scores above 60%. The study highlights the need for more regionally and culturally localized datasets to evaluate Arabic language models effectively.

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

30 Jul 2024 | Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, Timothy Baldwin