[slides and audio] Improving fine-grained understanding in image-text pre-training

The paper introduces SPARse Fine-grained Contrastive Alignment (SPARC), a method for pretraining multimodal models on large-scale noisy image-text data. SPARC aims to learn both coarse-grained and fine-grained information by grouping image patches corresponding to individual words in the caption. The method involves computing a sparse similarity metric between image patches and language tokens, and then using these similarities to compute language-grouped vision embeddings for each token. These embeddings are contrasted with token embeddings through a fine-grained sequence-wise loss, which only depends on individual samples and does not require other batch samples as negatives. This approach enables the model to learn more detailed information in a computationally efficient manner. SPARC combines this fine-grained loss with a global contrastive loss between image and text embeddings to encode both global and local information. Extensive experiments show that SPARC significantly improves performance on both coarse-grained and fine-grained downstream tasks, including classification, retrieval, object detection, and segmentation. Additionally, SPARC enhances model faithfulness and captioning in foundational vision-language models.The paper introduces SPARse Fine-grained Contrastive Alignment (SPARC), a method for pretraining multimodal models on large-scale noisy image-text data. SPARC aims to learn both coarse-grained and fine-grained information by grouping image patches corresponding to individual words in the caption. The method involves computing a sparse similarity metric between image patches and language tokens, and then using these similarities to compute language-grouped vision embeddings for each token. These embeddings are contrasted with token embeddings through a fine-grained sequence-wise loss, which only depends on individual samples and does not require other batch samples as negatives. This approach enables the model to learn more detailed information in a computationally efficient manner. SPARC combines this fine-grained loss with a global contrastive loss between image and text embeddings to encode both global and local information. Extensive experiments show that SPARC significantly improves performance on both coarse-grained and fine-grained downstream tasks, including classification, retrieval, object detection, and segmentation. Additionally, SPARC enhances model faithfulness and captioning in foundational vision-language models.

Improving fine-grained understanding in image-text pre-training

2024-1-19 | Ioana Bica, Anastasija Ilic, Matthias Bauer, Goker Erdogan, Matko Bosnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, Jovana Mitrovic