ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

July 14–18, 2024, Washington, USA | Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt
The paper introduces ArabicaQA, a comprehensive dataset for Arabic question answering, including 89,095 answerable and 3,701 unanswerable questions, along with additional open-domain questions. The dataset is designed to address the significant gap in Arabic natural language processing (NLP) resources. Additionally, the paper presents AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically tailored for Arabic text retrieval. The study also includes extensive benchmarking of large language models (LLMs) for Arabic question answering, providing insights into their performance in the Arabic context. The contributions of ArabicaQA, AraDPR, and the benchmarking of LLMs offer significant advancements in Arabic NLP, particularly in machine reading comprehension and open-domain question answering. The dataset and code are publicly available for further research.The paper introduces ArabicaQA, a comprehensive dataset for Arabic question answering, including 89,095 answerable and 3,701 unanswerable questions, along with additional open-domain questions. The dataset is designed to address the significant gap in Arabic natural language processing (NLP) resources. Additionally, the paper presents AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically tailored for Arabic text retrieval. The study also includes extensive benchmarking of large language models (LLMs) for Arabic question answering, providing insights into their performance in the Arabic context. The contributions of ArabicaQA, AraDPR, and the benchmarking of LLMs offer significant advancements in Arabic NLP, particularly in machine reading comprehension and open-domain question answering. The dataset and code are publicly available for further research.
Reach us at info@study.space