November 16–20, 2020 | Nils Reimers and Iryna Gurevych
This paper presents a method to extend existing monolingual sentence embedding models to new languages using multilingual knowledge distillation. The approach leverages a pre-trained monolingual model (teacher) to train a new model (student) such that the student's embeddings for a sentence and its translation are close to the teacher's embeddings for the original sentence. This method allows for the creation of multilingual sentence embeddings from previously monolingual models with relatively few samples per language. The approach is efficient, requires lower hardware resources, and ensures better alignment of vector spaces across languages.
The method is demonstrated on 50+ languages from various language families, with code available for extending sentence embeddings to over 400 languages. The approach outperforms existing methods like LASER and mUSE, especially for low-resource languages. The student model learns two key properties: vector spaces are aligned across languages, and the properties of the source language from the teacher model are transferred to other languages.
The paper evaluates the effectiveness of the approach on three tasks: semantic textual similarity (STS), bitext retrieval, and cross-lingual similarity search. Results show that the multilingual knowledge distillation approach achieves state-of-the-art performance, particularly for cross-lingual tasks. It also performs well on low-resource languages, achieving high accuracy scores even when the target language is not part of the pre-training data.
The method is tested on various training datasets, including parallel corpora and bilingual dictionaries. The results show that the quality of training data significantly impacts performance. The approach is also evaluated for language bias, where models may prefer certain language pairs over others. The multilingual knowledge distillation approach shows minimal language bias, making it suitable for multilingual sentence pools.
The paper concludes that the proposed method is effective for creating multilingual sentence embeddings with aligned vector spaces, and it can be applied to a wide range of languages with minimal training data. The approach is efficient, scalable, and suitable for both high- and low-resource languages.This paper presents a method to extend existing monolingual sentence embedding models to new languages using multilingual knowledge distillation. The approach leverages a pre-trained monolingual model (teacher) to train a new model (student) such that the student's embeddings for a sentence and its translation are close to the teacher's embeddings for the original sentence. This method allows for the creation of multilingual sentence embeddings from previously monolingual models with relatively few samples per language. The approach is efficient, requires lower hardware resources, and ensures better alignment of vector spaces across languages.
The method is demonstrated on 50+ languages from various language families, with code available for extending sentence embeddings to over 400 languages. The approach outperforms existing methods like LASER and mUSE, especially for low-resource languages. The student model learns two key properties: vector spaces are aligned across languages, and the properties of the source language from the teacher model are transferred to other languages.
The paper evaluates the effectiveness of the approach on three tasks: semantic textual similarity (STS), bitext retrieval, and cross-lingual similarity search. Results show that the multilingual knowledge distillation approach achieves state-of-the-art performance, particularly for cross-lingual tasks. It also performs well on low-resource languages, achieving high accuracy scores even when the target language is not part of the pre-training data.
The method is tested on various training datasets, including parallel corpora and bilingual dictionaries. The results show that the quality of training data significantly impacts performance. The approach is also evaluated for language bias, where models may prefer certain language pairs over others. The multilingual knowledge distillation approach shows minimal language bias, making it suitable for multilingual sentence pools.
The paper concludes that the proposed method is effective for creating multilingual sentence embeddings with aligned vector spaces, and it can be applied to a wide range of languages with minimal training data. The approach is efficient, scalable, and suitable for both high- and low-resource languages.