May 22-27, 2022 | Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang
This paper explores the use of pre-trained language models to learn cross-lingual sentence embeddings, focusing on combining the best methods for monolingual and cross-lingual representations. The authors investigate dual-encoder models, masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. They demonstrate that pre-trained multilingual language models significantly reduce the amount of parallel training data required, achieving 83.7% bi-text retrieval accuracy over 112 languages on the Tatoeba dataset, outperforming previous state-of-the-art models. The model also performs competitively on monolingual transfer learning benchmarks. The authors release their best multilingual sentence embedding model, LaBSE, which supports over 109 languages. The paper includes detailed experiments and ablation studies to understand the impact of various components, such as pre-training, negative sampling strategies, vocabulary choice, data quality, and quantity.This paper explores the use of pre-trained language models to learn cross-lingual sentence embeddings, focusing on combining the best methods for monolingual and cross-lingual representations. The authors investigate dual-encoder models, masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. They demonstrate that pre-trained multilingual language models significantly reduce the amount of parallel training data required, achieving 83.7% bi-text retrieval accuracy over 112 languages on the Tatoeba dataset, outperforming previous state-of-the-art models. The model also performs competitively on monolingual transfer learning benchmarks. The authors release their best multilingual sentence embedding model, LaBSE, which supports over 109 languages. The paper includes detailed experiments and ablation studies to understand the impact of various components, such as pre-training, negative sampling strategies, vocabulary choice, data quality, and quantity.