Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

26 Feb 2024 | Isabelle Mohr, Markus Krimmel, Saba Sturua, Mohammad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, Qi Liu, Ziniu Yu, Jie Fu, Saahil Ognawala, Susana Guzman, Bo Wang, Maximilian Werk, Nan Wang and Han Xiao
This paper introduces a novel family of bilingual text embedding models that support English and another target language, such as German or Spanish. These models are designed to handle long text inputs with up to 8192 tokens, making them highly versatile for various natural language processing tasks like text retrieval, clustering, and semantic textual similarity (STS) calculations. The models are trained using a multi-task learning approach, which significantly improves performance on STS tasks and cross-lingual evaluation tasks compared to existing multilingual models. Bilingual models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary size. The authors also expanded the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models, aiming to stimulate further research in text embedding technologies for these languages. The paper explores the training of bilingual embedding models, focusing on the use of a backbone language model that can handle multiple languages. It discusses the challenges of training multilingual models, such as the "curse of multilinguality," and proposes a solution by focusing on bilingual models that support specific language pairs. The authors introduce a novel multi-task learning strategy that enhances the performance of embedding models on STS and retrieval tasks. The models are trained using a combination of pre-training and fine-tuning techniques. Pre-training involves using a large corpus of text data, while fine-tuning is done on text pairs and with a multi-task objective to improve performance on specific tasks. The models are evaluated on various benchmarks, including the GLUE and XTREME benchmarks, showing that bilingual models outperform multilingual models in cross-lingual tasks and achieve competitive results in other tasks. The paper also presents results from ablation studies, demonstrating that bilingual models perform better when trained on embedding tasks with the same amount of pairwise training data. The multi-task learning approach is shown to be effective in improving model performance, particularly in STS tasks. The authors conclude that their bilingual models achieve superior or comparable performance to multilingual models while maintaining a smaller model size, making them more efficient and effective for cross-lingual tasks.This paper introduces a novel family of bilingual text embedding models that support English and another target language, such as German or Spanish. These models are designed to handle long text inputs with up to 8192 tokens, making them highly versatile for various natural language processing tasks like text retrieval, clustering, and semantic textual similarity (STS) calculations. The models are trained using a multi-task learning approach, which significantly improves performance on STS tasks and cross-lingual evaluation tasks compared to existing multilingual models. Bilingual models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary size. The authors also expanded the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models, aiming to stimulate further research in text embedding technologies for these languages. The paper explores the training of bilingual embedding models, focusing on the use of a backbone language model that can handle multiple languages. It discusses the challenges of training multilingual models, such as the "curse of multilinguality," and proposes a solution by focusing on bilingual models that support specific language pairs. The authors introduce a novel multi-task learning strategy that enhances the performance of embedding models on STS and retrieval tasks. The models are trained using a combination of pre-training and fine-tuning techniques. Pre-training involves using a large corpus of text data, while fine-tuning is done on text pairs and with a multi-task objective to improve performance on specific tasks. The models are evaluated on various benchmarks, including the GLUE and XTREME benchmarks, showing that bilingual models outperform multilingual models in cross-lingual tasks and achieve competitive results in other tasks. The paper also presents results from ablation studies, demonstrating that bilingual models perform better when trained on embedding tasks with the same amount of pairwise training data. The multi-task learning approach is shown to be effective in improving model performance, particularly in STS tasks. The authors conclude that their bilingual models achieve superior or comparable performance to multilingual models while maintaining a smaller model size, making them more efficient and effective for cross-lingual tasks.
Reach us at info@study.space
Understanding Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings