26 Jun 2024 | Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao
This paper introduces a novel multi-task contrastive training method to address the inefficiency of using separate models for text-only and multimodal tasks. The proposed method, jina-clip-v1, is trained on both text-image and text-text pairs, enabling it to perform well on both types of tasks. The model uses a dual encoder architecture, with a text encoder based on JinaBERT and an image encoder based on EVA02. The text encoder is trained using a multi-stage approach, with each stage focusing on different aspects of text and image alignment. The model is evaluated on several benchmarks, including the CLIP Benchmark and the MTEB Benchmark, and shows strong performance, outperforming other models in text-image retrieval and performing on par with specialized text models in text-only tasks. The model is currently limited to English-language texts due to limited multilingual resources, but future work will focus on extending this work to multilingual contexts. The paper also discusses related work in contrastive learning and provides detailed training settings and performance results.This paper introduces a novel multi-task contrastive training method to address the inefficiency of using separate models for text-only and multimodal tasks. The proposed method, jina-clip-v1, is trained on both text-image and text-text pairs, enabling it to perform well on both types of tasks. The model uses a dual encoder architecture, with a text encoder based on JinaBERT and an image encoder based on EVA02. The text encoder is trained using a multi-stage approach, with each stage focusing on different aspects of text and image alignment. The model is evaluated on several benchmarks, including the CLIP Benchmark and the MTEB Benchmark, and shows strong performance, outperforming other models in text-image retrieval and performing on par with specialized text models in text-only tasks. The model is currently limited to English-language texts due to limited multilingual resources, but future work will focus on extending this work to multilingual contexts. The paper also discusses related work in contrastive learning and provides detailed training settings and performance results.