SPARC is a novel method for pretraining multimodal models that learns both coarse-grained and fine-grained information from image-text pairs. It introduces a sparse similarity metric between image patches and language tokens, enabling the learning of language-grouped vision embeddings by aggregating image patches corresponding to individual words in the caption. These embeddings are then contrasted with token embeddings using a fine-grained sequence-wise loss, which focuses on individual samples and does not require other batch samples as negatives. SPARC combines this fine-grained loss with a global contrastive loss between image and text embeddings, allowing the model to encode both global and local information. The method is evaluated on various tasks, including image classification, retrieval, object detection, and segmentation, showing improved performance over competing approaches. SPARC also enhances model faithfulness and captioning in foundational vision-language models. The approach addresses several limitations of existing methods, such as computational and memory intensity, reliance on softmax for attention weights, and the need for pre-trained models. SPARC's design allows for more flexible and efficient learning of fine-grained information, making it suitable for a wide range of vision-language tasks.SPARC is a novel method for pretraining multimodal models that learns both coarse-grained and fine-grained information from image-text pairs. It introduces a sparse similarity metric between image patches and language tokens, enabling the learning of language-grouped vision embeddings by aggregating image patches corresponding to individual words in the caption. These embeddings are then contrasted with token embeddings using a fine-grained sequence-wise loss, which focuses on individual samples and does not require other batch samples as negatives. SPARC combines this fine-grained loss with a global contrastive loss between image and text embeddings, allowing the model to encode both global and local information. The method is evaluated on various tasks, including image classification, retrieval, object detection, and segmentation, showing improved performance over competing approaches. SPARC also enhances model faithfulness and captioning in foundational vision-language models. The approach addresses several limitations of existing methods, such as computational and memory intensity, reliance on softmax for attention weights, and the need for pre-trained models. SPARC's design allows for more flexible and efficient learning of fine-grained information, making it suitable for a wide range of vision-language tasks.