July 2024 | Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen and Xiaohua Zhai
PaliGemma is an open-source Vision-Language Model (VLM) that combines the SigLIP-So400m vision encoder and the Gemma-2B language model. It is designed to be versatile and effective for transfer learning, achieving strong performance on a wide range of tasks. The model is trained to be a general-purpose base model, maintaining performance comparable to larger models like PaLI-X and PaLM-E while being significantly smaller (less than 3B parameters). PaliGemma is evaluated on nearly 40 diverse tasks, including standard VLM benchmarks and specialized tasks such as remote sensing and segmentation. The paper details the architecture, training process, and evaluation of PaliGemma, highlighting its effectiveness in various applications. Key contributions include the use of a prefix-LM strategy for pretraining, the importance of not freezing the image encoder during multimodal pretraining, and the benefits of providing multiple resolution checkpoints for improved performance on resolution-sensitive tasks. The results demonstrate that PaliGemma can be effectively fine-tuned for a wide range of tasks with minimal hyperparameter tuning and a small number of examples.PaliGemma is an open-source Vision-Language Model (VLM) that combines the SigLIP-So400m vision encoder and the Gemma-2B language model. It is designed to be versatile and effective for transfer learning, achieving strong performance on a wide range of tasks. The model is trained to be a general-purpose base model, maintaining performance comparable to larger models like PaLI-X and PaLM-E while being significantly smaller (less than 3B parameters). PaliGemma is evaluated on nearly 40 diverse tasks, including standard VLM benchmarks and specialized tasks such as remote sensing and segmentation. The paper details the architecture, training process, and evaluation of PaliGemma, highlighting its effectiveness in various applications. Key contributions include the use of a prefix-LM strategy for pretraining, the importance of not freezing the image encoder during multimodal pretraining, and the benefits of providing multiple resolution checkpoints for improved performance on resolution-sensitive tasks. The results demonstrate that PaliGemma can be effectively fine-tuned for a wide range of tasks with minimal hyperparameter tuning and a small number of examples.