26 Jun 2024 | Andreas Koukounas * 1 Georgios Mastrapas * 1 Michael Günther 1 Bo Wang 1 Scott Martens 1 Isabelle Mohr 1 Saba Sturua 1 Mohammad Kalim Akram 1 Joan Fontanals Martínez 1 Saahil Ognawala 1 Susana Guzman 1 Maximilian Werk 1 Nan Wang 1 Han Xiao 1
The paper introduces a novel multi-task contrastive training method to address the underperformance of CLIP models in text-only tasks compared to specialized text models. The proposed method, applied to the jina-clip-vl model, jointly optimizes for text-image and text-text matching, enabling it to perform well in both types of tasks. The model is trained using large-scale image-caption pairs and text pairs, with a three-stage training process that includes optimizing for text-image alignment, long caption processing, and hard negative sampling for text-text matching. The resulting jina-clip-vl model achieves state-of-the-art performance on both text-image and text-text retrieval tasks, outperforming OpenAI's CLIP and EVA-CLIP models in cross-modal benchmarks. The model also competes closely with top-tier text-only embedding models on the Massive Text Embedding Benchmark (MTEB), demonstrating its effectiveness in text-only tasks. The paper highlights the potential savings for applications that can replace separate models for different task modalities with a unified multimodal model.The paper introduces a novel multi-task contrastive training method to address the underperformance of CLIP models in text-only tasks compared to specialized text models. The proposed method, applied to the jina-clip-vl model, jointly optimizes for text-image and text-text matching, enabling it to perform well in both types of tasks. The model is trained using large-scale image-caption pairs and text pairs, with a three-stage training process that includes optimizing for text-image alignment, long caption processing, and hard negative sampling for text-text matching. The resulting jina-clip-vl model achieves state-of-the-art performance on both text-image and text-text retrieval tasks, outperforming OpenAI's CLIP and EVA-CLIP models in cross-modal benchmarks. The model also competes closely with top-tier text-only embedding models on the Massive Text Embedding Benchmark (MTEB), demonstrating its effectiveness in text-only tasks. The paper highlights the potential savings for applications that can replace separate models for different task modalities with a unified multimodal model.