Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

25 Sep 2019 | Mikel Artetxe, Holger Schwenk
This paper introduces a system for learning joint multilingual sentence embeddings for 93 languages, covering more than 30 language families and written in 28 scripts. The system uses a single BiLSTM encoder with a shared BPE vocabulary, trained on publicly available parallel corpora. An auxiliary decoder is used to train the encoder, enabling the creation of a classifier that can be applied to any of the 93 languages without modification. The system is evaluated on cross-lingual natural language inference (XNLI), cross-lingual document classification (MLDoc), and parallel corpus mining (BUCC) datasets, showing strong performance. A new test set of aligned sentences in 112 languages is introduced, demonstrating the effectiveness of the embeddings even for low-resource languages. The system is implemented and available at https://github.com/facebookresearch/LASER. The paper discusses the challenges of deep learning in NLP, which requires large amounts of data, limiting its applicability in many practical scenarios. The proposed approach aims to create universal language-agnostic sentence embeddings that are general across both input language and NLP task. This is achieved by training a single encoder to handle multiple languages, allowing semantically similar sentences in different languages to be close in the embedding space. Previous work in multilingual NLP has focused on a few languages or specific applications. This paper presents a system that learns general-purpose sentence representations for 93 languages, using a single pre-trained BiLSTM encoder. The system is trained on a large corpus of parallel sentences, and the encoder is used to embed sentences in any of the training languages. The system is evaluated on various tasks, including cross-lingual natural language inference, cross-lingual document classification, and parallel corpus mining, showing strong performance. The system also introduces a new test set of aligned sentences in 112 languages, demonstrating the effectiveness of the embeddings even for low-resource languages. The paper also discusses the limitations of previous approaches, such as cross-lingual word embeddings, which often require weak or no cross-lingual signal. The proposed approach uses a sequence-to-sequence encoder-decoder architecture, trained end-to-end on parallel corpora. The encoder is used to embed sentences in any of the training languages, while the decoder is discarded. The system is evaluated on various tasks, including cross-lingual natural language inference, cross-lingual document classification, and parallel corpus mining, showing strong performance. The system also introduces a new test set of aligned sentences in 112 languages, demonstrating the effectiveness of the embeddings even for low-resource languages.This paper introduces a system for learning joint multilingual sentence embeddings for 93 languages, covering more than 30 language families and written in 28 scripts. The system uses a single BiLSTM encoder with a shared BPE vocabulary, trained on publicly available parallel corpora. An auxiliary decoder is used to train the encoder, enabling the creation of a classifier that can be applied to any of the 93 languages without modification. The system is evaluated on cross-lingual natural language inference (XNLI), cross-lingual document classification (MLDoc), and parallel corpus mining (BUCC) datasets, showing strong performance. A new test set of aligned sentences in 112 languages is introduced, demonstrating the effectiveness of the embeddings even for low-resource languages. The system is implemented and available at https://github.com/facebookresearch/LASER. The paper discusses the challenges of deep learning in NLP, which requires large amounts of data, limiting its applicability in many practical scenarios. The proposed approach aims to create universal language-agnostic sentence embeddings that are general across both input language and NLP task. This is achieved by training a single encoder to handle multiple languages, allowing semantically similar sentences in different languages to be close in the embedding space. Previous work in multilingual NLP has focused on a few languages or specific applications. This paper presents a system that learns general-purpose sentence representations for 93 languages, using a single pre-trained BiLSTM encoder. The system is trained on a large corpus of parallel sentences, and the encoder is used to embed sentences in any of the training languages. The system is evaluated on various tasks, including cross-lingual natural language inference, cross-lingual document classification, and parallel corpus mining, showing strong performance. The system also introduces a new test set of aligned sentences in 112 languages, demonstrating the effectiveness of the embeddings even for low-resource languages. The paper also discusses the limitations of previous approaches, such as cross-lingual word embeddings, which often require weak or no cross-lingual signal. The proposed approach uses a sequence-to-sequence encoder-decoder architecture, trained end-to-end on parallel corpora. The encoder is used to embed sentences in any of the training languages, while the decoder is discarded. The system is evaluated on various tasks, including cross-lingual natural language inference, cross-lingual document classification, and parallel corpus mining, showing strong performance. The system also introduces a new test set of aligned sentences in 112 languages, demonstrating the effectiveness of the embeddings even for low-resource languages.
Reach us at info@study.space