[slides] PaliGemma%3A A versatile 3B VLM for transfer

PaliGemma is an open-source Vision-Language Model (VLM) that combines the SigLIP-So400m vision encoder and the Gemma-2B language model. It is designed to be versatile and effective for transfer learning, achieving strong performance on a wide range of tasks. The model is trained to be a general-purpose base model, maintaining performance comparable to larger models like PaLI-X and PaLM-E while being significantly smaller (less than 3B parameters). PaliGemma is evaluated on nearly 40 diverse tasks, including standard VLM benchmarks and specialized tasks such as remote sensing and segmentation. The paper details the architecture, training process, and evaluation of PaliGemma, highlighting its effectiveness in various applications. Key contributions include the use of a prefix-LM strategy for pretraining, the importance of not freezing the image encoder during multimodal pretraining, and the benefits of providing multiple resolution checkpoints for improved performance on resolution-sensitive tasks. The results demonstrate that PaliGemma can be effectively fine-tuned for a wide range of tasks with minimal hyperparameter tuning and a small number of examples.PaliGemma is an open-source Vision-Language Model (VLM) that combines the SigLIP-So400m vision encoder and the Gemma-2B language model. It is designed to be versatile and effective for transfer learning, achieving strong performance on a wide range of tasks. The model is trained to be a general-purpose base model, maintaining performance comparable to larger models like PaLI-X and PaLM-E while being significantly smaller (less than 3B parameters). PaliGemma is evaluated on nearly 40 diverse tasks, including standard VLM benchmarks and specialized tasks such as remote sensing and segmentation. The paper details the architecture, training process, and evaluation of PaliGemma, highlighting its effectiveness in various applications. Key contributions include the use of a prefix-LM strategy for pretraining, the importance of not freezing the image encoder during multimodal pretraining, and the benefits of providing multiple resolution checkpoints for improved performance on resolution-sensitive tasks. The results demonstrate that PaliGemma can be effectively fine-tuned for a wide range of tasks with minimal hyperparameter tuning and a small number of examples.

PaliGemma: A versatile 3B VLM for transfer