The paper introduces ArabicMMLU, the first multi-task language understanding benchmark for Arabic, sourced from school exams across diverse educational levels in North Africa, the Levant, and the Gulf regions. The dataset comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), constructed through collaboration with native speakers. Experiments with 35 models reveal significant room for improvement, particularly among open-source models. GPT-4 achieves the best performance, while open-source models struggle to achieve scores above 60%. The study highlights the need for more regionally and culturally localized datasets to evaluate Arabic language models effectively.The paper introduces ArabicMMLU, the first multi-task language understanding benchmark for Arabic, sourced from school exams across diverse educational levels in North Africa, the Levant, and the Gulf regions. The dataset comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), constructed through collaboration with native speakers. Experiments with 35 models reveal significant room for improvement, particularly among open-source models. GPT-4 achieves the best performance, while open-source models struggle to achieve scores above 60%. The study highlights the need for more regionally and culturally localized datasets to evaluate Arabic language models effectively.