24 May 2024 | Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia, Abdelrahman Mohamed, Muhammad Abdul-Mageed
The paper introduces *Peacock*, a family of Arabic multimodal large language models (MLLMs) designed to bridge the gap in multimodal understanding capabilities for Arabic and Egyptian dialects. The models are trained using a combination of high-quality pretraining data from English datasets translated into Arabic and instruction finetuning datasets. *Peacock* includes two architectures: one based on InstructBlip and another on LLaVA1.5, both integrated with powerful Arabic language models (LMs) such as AceGPT and AraLLaMA. The models are evaluated on various benchmarks, including SEED-Bench, LLaVA-Bench, and a new benchmark called Henna, which focuses on Arabic cultural elements. The results show that *Peacock* models outperform the multilingual mBlip model on several tasks, demonstrating strong performance in visual reasoning and dialectal Arabic responses. The paper also highlights the importance of high-quality data processing and the selection of appropriate LMs for multimodal task performance. However, limitations include object hallucination, translation errors, and the lack of image-text pairs in the training datasets. The authors plan to release their models and a demo for future research and applications.The paper introduces *Peacock*, a family of Arabic multimodal large language models (MLLMs) designed to bridge the gap in multimodal understanding capabilities for Arabic and Egyptian dialects. The models are trained using a combination of high-quality pretraining data from English datasets translated into Arabic and instruction finetuning datasets. *Peacock* includes two architectures: one based on InstructBlip and another on LLaVA1.5, both integrated with powerful Arabic language models (LMs) such as AceGPT and AraLLaMA. The models are evaluated on various benchmarks, including SEED-Bench, LLaVA-Bench, and a new benchmark called Henna, which focuses on Arabic cultural elements. The results show that *Peacock* models outperform the multilingual mBlip model on several tasks, demonstrating strong performance in visual reasoning and dialectal Arabic responses. The paper also highlights the importance of high-quality data processing and the selection of appropriate LMs for multimodal task performance. However, limitations include object hallucination, translation errors, and the lack of image-text pairs in the training datasets. The authors plan to release their models and a demo for future research and applications.