[slides and audio] Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

This paper introduces a suite of state-of-the-art bilingual text embedding models designed to support English and another target language, capable of processing text inputs up to 8192 tokens. These models are versatile for various NLP tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations. By focusing on bilingual models and introducing a unique multi-task learning objective, the models significantly improve performance on STS tasks, outperforming existing multilingual models in both target language understanding and cross-lingual evaluation. The models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary needs. The authors also expand the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models, aiming to stimulate further research and advancement in text embedding technologies for these languages. The evaluation results show that the bilingual models achieve superior or competitive performance compared to multilingual models while having a lower number of parameters. Ablation studies verify that bilingual models perform better than multilingual models when fine-tuned on an embedding task, and multi-task learning effectively enhances model performance on STS tasks.This paper introduces a suite of state-of-the-art bilingual text embedding models designed to support English and another target language, capable of processing text inputs up to 8192 tokens. These models are versatile for various NLP tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations. By focusing on bilingual models and introducing a unique multi-task learning objective, the models significantly improve performance on STS tasks, outperforming existing multilingual models in both target language understanding and cross-lingual evaluation. The models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary needs. The authors also expand the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models, aiming to stimulate further research and advancement in text embedding technologies for these languages. The evaluation results show that the bilingual models achieve superior or competitive performance compared to multilingual models while having a lower number of parameters. Ablation studies verify that bilingual models perform better than multilingual models when fine-tuned on an embedding task, and multi-task learning effectively enhances model performance on STS tasks.

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

26 Feb 2024 | Isabelle Mohr, Markus Krimmel, Saba Sturua, Mohammad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, Qi Liu, Ziniu Yu, Jie Fu, Saahil Ognawala, Susana Guzman, Bo Wang, Maximilian Werk, Nan Wang and Han Xiao