Understanding Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

The paper introduces an architecture for learning joint multilingual sentence representations for 93 languages, covering over 30 families and 28 scripts. The system uses a single BiLSTM encoder with a shared BPE vocabulary, coupled with an auxiliary decoder, trained on publicly available parallel corpora. This enables the use of English-annotated data to train a classifier, which can be transferred to any of the 93 languages without modification. Experiments in cross-lingual natural language inference (XNLI), cross-lingual document classification (MLDoc), and parallel corpus mining (BUC) datasets demonstrate the effectiveness of the approach. A new test set of aligned sentences in 112 languages is introduced, showing strong results in multilingual similarity search, even for low-resource languages. The implementation, pre-trained encoder, and multilingual test set are available at <https://github.com/facebookresearch/LASER>.The paper introduces an architecture for learning joint multilingual sentence representations for 93 languages, covering over 30 families and 28 scripts. The system uses a single BiLSTM encoder with a shared BPE vocabulary, coupled with an auxiliary decoder, trained on publicly available parallel corpora. This enables the use of English-annotated data to train a classifier, which can be transferred to any of the 93 languages without modification. Experiments in cross-lingual natural language inference (XNLI), cross-lingual document classification (MLDoc), and parallel corpus mining (BUC) datasets demonstrate the effectiveness of the approach. A new test set of aligned sentences in 112 languages is introduced, showing strong results in multilingual similarity search, even for low-resource languages. The implementation, pre-trained encoder, and multilingual test set are available at <https://github.com/facebookresearch/LASER>.

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

25 Sep 2019 | Mikel Artetxe, Holger Schwenk