24 May 2024 | Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia, Abdelrahman Mohamed, Muhammad Abdul-Mageed
Peacock is a family of Arabic multimodal large language models (MLLMs) with strong vision and language capabilities, designed to address the lack of high-quality multimodal resources in Arabic. The models are trained on a comprehensive dataset of text-image pairs, with a focus on visual reasoning and dialectal Arabic. Peacock includes a new benchmark, Henna, specifically designed to evaluate MLLMs on aspects related to Arabic culture. The models are trained in two stages: pretraining and instruction finetuning. The pretraining stage involves aligning visual and textual features in a common space, while the instruction finetuning stage enhances the models' ability to perform complex reasoning tasks and engage in visual conversations. The models are evaluated on various benchmarks, including SEED-Bench, LLaVA-Bench, and Henna, demonstrating strong performance in visual reasoning, image captioning, and cultural understanding. The Peacock models also show promising capabilities in generating responses in dialectal Arabic, particularly when fine-tuned on Egyptian dialect data. The study highlights the importance of high-quality data and the selection of appropriate language models for effective multimodal task performance. The research contributes to the development of culturally-aware Arabic MLLMs and sets a new benchmark for future work in this area.Peacock is a family of Arabic multimodal large language models (MLLMs) with strong vision and language capabilities, designed to address the lack of high-quality multimodal resources in Arabic. The models are trained on a comprehensive dataset of text-image pairs, with a focus on visual reasoning and dialectal Arabic. Peacock includes a new benchmark, Henna, specifically designed to evaluate MLLMs on aspects related to Arabic culture. The models are trained in two stages: pretraining and instruction finetuning. The pretraining stage involves aligning visual and textual features in a common space, while the instruction finetuning stage enhances the models' ability to perform complex reasoning tasks and engage in visual conversations. The models are evaluated on various benchmarks, including SEED-Bench, LLaVA-Bench, and Henna, demonstrating strong performance in visual reasoning, image captioning, and cultural understanding. The Peacock models also show promising capabilities in generating responses in dialectal Arabic, particularly when fine-tuned on Egyptian dialect data. The study highlights the importance of high-quality data and the selection of appropriate language models for effective multimodal task performance. The research contributes to the development of culturally-aware Arabic MLLMs and sets a new benchmark for future work in this area.