Language-agnostic BERT Sentence Embedding

Language-agnostic BERT Sentence Embedding

May 22-27, 2022 | Fangxiao Yu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang
This paper introduces LaBSE, a language-agnostic BERT sentence embedding model that supports 109 languages. The model achieves state-of-the-art performance on various bi-text retrieval and mining tasks, outperforming previous methods. It uses a combination of pre-trained language models and dual-encoder fine-tuning to improve translation ranking performance. The model is trained on a large corpus of bilingual and monolingual data, and it performs well on low-resource languages, achieving 83.7% bi-text retrieval accuracy on Tatoeba, which is significantly higher than previous state-of-the-art models. LaBSE also performs competitively on monolingual transfer learning benchmarks and semantic similarity tasks. The model is publicly available at https://tfhub.dev/google/LaBSE. The paper also explores the effectiveness of pre-training, negative sampling strategies, and vocabulary choice in improving model performance. It shows that pre-training significantly improves performance and reduces the amount of parallel data required. The model is trained on a large-scale multilingual corpus, and it demonstrates strong performance on a variety of cross-lingual tasks. The paper also evaluates the model on the BUCC shared task and shows that it outperforms previous models. LaBSE is also effective on downstream classification tasks, achieving competitive performance on the SentEval benchmark. The paper concludes that LaBSE is a powerful model for cross-lingual tasks and that further research is needed to improve its performance on low-resource languages.This paper introduces LaBSE, a language-agnostic BERT sentence embedding model that supports 109 languages. The model achieves state-of-the-art performance on various bi-text retrieval and mining tasks, outperforming previous methods. It uses a combination of pre-trained language models and dual-encoder fine-tuning to improve translation ranking performance. The model is trained on a large corpus of bilingual and monolingual data, and it performs well on low-resource languages, achieving 83.7% bi-text retrieval accuracy on Tatoeba, which is significantly higher than previous state-of-the-art models. LaBSE also performs competitively on monolingual transfer learning benchmarks and semantic similarity tasks. The model is publicly available at https://tfhub.dev/google/LaBSE. The paper also explores the effectiveness of pre-training, negative sampling strategies, and vocabulary choice in improving model performance. It shows that pre-training significantly improves performance and reduces the amount of parallel data required. The model is trained on a large-scale multilingual corpus, and it demonstrates strong performance on a variety of cross-lingual tasks. The paper also evaluates the model on the BUCC shared task and shows that it outperforms previous models. LaBSE is also effective on downstream classification tasks, achieving competitive performance on the SentEval benchmark. The paper concludes that LaBSE is a powerful model for cross-lingual tasks and that further research is needed to improve its performance on low-resource languages.
Reach us at info@study.space
Understanding Language-agnostic BERT Sentence Embedding