July 2024 | Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grcyner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen and Xiaohua Zhai
PaliGemma is a versatile open-source Vision-Language Model (VLM) based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a general-purpose model that can be effectively transferred to various tasks. PaliGemma achieves strong performance across a wide range of open-world tasks, including standard VLM benchmarks and specialized tasks like remote-sensing and segmentation. It is designed to be a flexible base model that can be fine-tuned for different applications.
PaliGemma builds on the PaLI series of vision-language models, which have shown strong scaling results. It combines the 400M SigLIP and 2B Gemma models into a sub-3B VLM that maintains performance comparable to PaLI-X, PaLM-E, and PaLI-3. Gemma is a family of open-source large language models, with the 2B pretrained version used in PaliGemma.
The model's architecture consists of an image encoder, a decoder-only language model, and a linear layer that projects SigLIP's output tokens into the same dimensions as Gemma2B's vocabulary tokens. The image is processed through the image encoder, which converts it into a sequence of tokens, and the text is converted into tokens using Gemma's tokenizer. The sequence of input tokens is then passed to the decoder.
PaliGemma is trained through three stages: unimodal pretraining, multimodal pretraining, and resolution increase. During unimodal pretraining, the model is pretrained individually on text, object detection, and instance segmentation. Multimodal pretraining involves training the model on a broad mixture of vision-language tasks. Resolution increase involves training the model at higher resolutions to improve its ability to process higher-resolution images.
PaliGemma is evaluated on over 40 diverse tasks, including standard VLM benchmarks and specialized tasks. It shows strong performance on tasks such as image classification, captioning, visual question-answering, and dialogue. It also performs well on more complex tasks such as detection, instance segmentation, and panoptic segmentation.
The model is trained using a combination of different tasks and datasets, including captioning, OCR, answer generation, detection, segmentation, and grounded captioning. The pretraining tasks are designed to result in a model that transfers well to a wide range of tasks, not necessarily a model that is usable out of the box.
PaliGemma is a versatile base VLM that can be fine-tuned for various applications. It is designed to be a useful starting point for further research in instruction tuning, specific applications, and encourages clearer separation of base models and fine-tunes in VLM research.PaliGemma is a versatile open-source Vision-Language Model (VLM) based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a general-purpose model that can be effectively transferred to various tasks. PaliGemma achieves strong performance across a wide range of open-world tasks, including standard VLM benchmarks and specialized tasks like remote-sensing and segmentation. It is designed to be a flexible base model that can be fine-tuned for different applications.
PaliGemma builds on the PaLI series of vision-language models, which have shown strong scaling results. It combines the 400M SigLIP and 2B Gemma models into a sub-3B VLM that maintains performance comparable to PaLI-X, PaLM-E, and PaLI-3. Gemma is a family of open-source large language models, with the 2B pretrained version used in PaliGemma.
The model's architecture consists of an image encoder, a decoder-only language model, and a linear layer that projects SigLIP's output tokens into the same dimensions as Gemma2B's vocabulary tokens. The image is processed through the image encoder, which converts it into a sequence of tokens, and the text is converted into tokens using Gemma's tokenizer. The sequence of input tokens is then passed to the decoder.
PaliGemma is trained through three stages: unimodal pretraining, multimodal pretraining, and resolution increase. During unimodal pretraining, the model is pretrained individually on text, object detection, and instance segmentation. Multimodal pretraining involves training the model on a broad mixture of vision-language tasks. Resolution increase involves training the model at higher resolutions to improve its ability to process higher-resolution images.
PaliGemma is evaluated on over 40 diverse tasks, including standard VLM benchmarks and specialized tasks. It shows strong performance on tasks such as image classification, captioning, visual question-answering, and dialogue. It also performs well on more complex tasks such as detection, instance segmentation, and panoptic segmentation.
The model is trained using a combination of different tasks and datasets, including captioning, OCR, answer generation, detection, segmentation, and grounded captioning. The pretraining tasks are designed to result in a model that transfers well to a wide range of tasks, not necessarily a model that is usable out of the box.
PaliGemma is a versatile base VLM that can be fine-tuned for various applications. It is designed to be a useful starting point for further research in instruction tuning, specific applications, and encourages clearer separation of base models and fine-tunes in VLM research.