PALO: A Polyglot Large Multimodal Model for 5B People

PALO: A Polyglot Large Multimodal Model for 5B People

5 Mar 2024 | Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan
PALO is a polyglot large multimodal model designed to provide visual reasoning capabilities in 10 major languages, covering approximately 5 billion people (65% of the global population). The model is developed to address the gap in multilingual large multimodal models (LMMs) by offering a fully open-source solution that supports ten languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese. The model is trained on a multilingual instruction-tuning dataset, which is created through a semi-automated translation process using a fine-tuned large language model (LLM), ensuring high linguistic fidelity and scalability. The dataset is then used to train the model across three scales (1.7B, 7B, and 13B parameters) to demonstrate generalization and scalability. The model is evaluated against strong baselines and shows substantial improvements in performance across multiple languages, especially for underrepresented languages like Hindi, Arabic, Bengali, and Urdu. The model is also evaluated on a newly proposed multilingual multimodal benchmark, which assesses the vision-language reasoning capabilities across languages. The model's architecture integrates a vision encoder with a language model, allowing it to generate responses in ten different languages. The model is trained on a diverse instruction dataset, which includes conversations from ten languages, and is fine-tuned to improve performance on both high-resource and low-resource languages. The model's performance is evaluated across various languages, showing significant improvements in both high-resource and low-resource languages. The model is also evaluated on a benchmark dataset, which includes 24 diverse and challenging images from different domains, each with detailed descriptions and a set of 60 questions. The results show that the model achieves robust performance in high-resource languages and significant improvements in low-resource languages. The model's performance is also evaluated on a mobile version, which demonstrates consistent improvements across both high-resource and low-resource languages. The model's performance is further enhanced by the use of a lightweight downsample projector, which reduces the input tokens to the language model and significantly reduces the training and inference time. The model is also evaluated on a benchmark dataset, which includes 24 diverse and challenging images from different domains, each with detailed descriptions and a set of 60 questions. The results show that the model achieves robust performance in high-resource languages and significant improvements in low-resource languages. The model's performance is also evaluated on a benchmark dataset, which includes 24 diverse and challenging images from different domains, each with detailed descriptions and a set of 60 questions. The results show that the model achieves robust performance in high-resource languages and significant improvements in low-resource languages.PALO is a polyglot large multimodal model designed to provide visual reasoning capabilities in 10 major languages, covering approximately 5 billion people (65% of the global population). The model is developed to address the gap in multilingual large multimodal models (LMMs) by offering a fully open-source solution that supports ten languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese. The model is trained on a multilingual instruction-tuning dataset, which is created through a semi-automated translation process using a fine-tuned large language model (LLM), ensuring high linguistic fidelity and scalability. The dataset is then used to train the model across three scales (1.7B, 7B, and 13B parameters) to demonstrate generalization and scalability. The model is evaluated against strong baselines and shows substantial improvements in performance across multiple languages, especially for underrepresented languages like Hindi, Arabic, Bengali, and Urdu. The model is also evaluated on a newly proposed multilingual multimodal benchmark, which assesses the vision-language reasoning capabilities across languages. The model's architecture integrates a vision encoder with a language model, allowing it to generate responses in ten different languages. The model is trained on a diverse instruction dataset, which includes conversations from ten languages, and is fine-tuned to improve performance on both high-resource and low-resource languages. The model's performance is evaluated across various languages, showing significant improvements in both high-resource and low-resource languages. The model is also evaluated on a benchmark dataset, which includes 24 diverse and challenging images from different domains, each with detailed descriptions and a set of 60 questions. The results show that the model achieves robust performance in high-resource languages and significant improvements in low-resource languages. The model's performance is also evaluated on a mobile version, which demonstrates consistent improvements across both high-resource and low-resource languages. The model's performance is further enhanced by the use of a lightweight downsample projector, which reduces the input tokens to the language model and significantly reduces the training and inference time. The model is also evaluated on a benchmark dataset, which includes 24 diverse and challenging images from different domains, each with detailed descriptions and a set of 60 questions. The results show that the model achieves robust performance in high-resource languages and significant improvements in low-resource languages. The model's performance is also evaluated on a benchmark dataset, which includes 24 diverse and challenging images from different domains, each with detailed descriptions and a set of 60 questions. The results show that the model achieves robust performance in high-resource languages and significant improvements in low-resource languages.
Reach us at info@study.space
Understanding Palo%3A A Polyglot Large Multimodal Model for 5B People