[slides and audio] ArabicaQA%3A A Comprehensive Dataset for Arabic Question Answering

ArabicaQA is a comprehensive dataset for Arabic question answering, consisting of 89,095 answerable and 3,701 unanswerable questions, along with 76,266 open-domain question-answer pairs. It is the largest dataset of its kind for Arabic and includes a dense passage retrieval model, AraDPR, trained on the Arabic Wikipedia corpus. The dataset and model are publicly available for further research. The study also evaluates large language models (LLMs) for Arabic question answering, highlighting their performance in the Arabic context. ArabicaQA addresses the lack of resources for Arabic NLP, providing a valuable resource for developing and testing models in machine reading comprehension and open-domain QA. The dataset includes extensive annotations and is designed to support both research and practical applications in Arabic language processing. The study also introduces a novel dense retrieval model, AraDPR, which is effective for Arabic text retrieval. The dataset and model are evaluated in various experiments, demonstrating their effectiveness in improving Arabic NLP. The research contributes to the field of Arabic NLP by providing a comprehensive dataset and model, along with benchmarking of LLMs for Arabic QA. The dataset and model are designed to support further research and development in Arabic language processing.ArabicaQA is a comprehensive dataset for Arabic question answering, consisting of 89,095 answerable and 3,701 unanswerable questions, along with 76,266 open-domain question-answer pairs. It is the largest dataset of its kind for Arabic and includes a dense passage retrieval model, AraDPR, trained on the Arabic Wikipedia corpus. The dataset and model are publicly available for further research. The study also evaluates large language models (LLMs) for Arabic question answering, highlighting their performance in the Arabic context. ArabicaQA addresses the lack of resources for Arabic NLP, providing a valuable resource for developing and testing models in machine reading comprehension and open-domain QA. The dataset includes extensive annotations and is designed to support both research and practical applications in Arabic language processing. The study also introduces a novel dense retrieval model, AraDPR, which is effective for Arabic text retrieval. The dataset and model are evaluated in various experiments, demonstrating their effectiveness in improving Arabic NLP. The research contributes to the field of Arabic NLP by providing a comprehensive dataset and model, along with benchmarking of LLMs for Arabic QA. The dataset and model are designed to support further research and development in Arabic language processing.

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

2024 | Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt