5 Mar 2024 | Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan
The paper introduces PALO, a large multilingual multimodal model designed to handle visual reasoning tasks in 10 major languages, covering approximately 5 billion people (65% of the global population). PALO addresses the gap in existing multimodal models by focusing on underrepresented languages such as Hindi, Arabic, Bengali, and Urdu. The model is trained using a semi-automated translation approach, where a state-of-the-art Large Language Model (LLM) is used to translate an English dataset into the target languages, ensuring high linguistic fidelity. The resulting dataset is then fine-tuned to improve the quality of translations and ensure accurate representation of each language. PALO is trained across three scales (1.7B, 7B, and 13B parameters) to demonstrate generalization and scalability, showing significant improvements over strong baselines, especially in low-resource languages. The paper also proposes a multilingual multimodal benchmark to evaluate the vision-language reasoning capabilities of future models. The code for PALO is available on GitHub.The paper introduces PALO, a large multilingual multimodal model designed to handle visual reasoning tasks in 10 major languages, covering approximately 5 billion people (65% of the global population). PALO addresses the gap in existing multimodal models by focusing on underrepresented languages such as Hindi, Arabic, Bengali, and Urdu. The model is trained using a semi-automated translation approach, where a state-of-the-art Large Language Model (LLM) is used to translate an English dataset into the target languages, ensuring high linguistic fidelity. The resulting dataset is then fine-tuned to improve the quality of translations and ensure accurate representation of each language. PALO is trained across three scales (1.7B, 7B, and 13B parameters) to demonstrate generalization and scalability, showing significant improvements over strong baselines, especially in low-resource languages. The paper also proposes a multilingual multimodal benchmark to evaluate the vision-language reasoning capabilities of future models. The code for PALO is available on GitHub.