[slides and audio] Unsupervised Cross-lingual Representation Learning at Scale

This paper presents XLM-R, a Transformer-based multilingual masked language model trained on 100 languages using over 2 terabytes of filtered CommonCrawl data. XLM-R significantly outperforms multilingual BERT (mBERT) on various cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. It performs particularly well on low-resource languages, improving by 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. The paper also provides a detailed empirical analysis of key factors affecting performance, such as the trade-offs between positive transfer and capacity dilution, and the performance of high and low-resource languages at scale. Additionally, XLM-R is competitive with strong monolingual models on the GLUE and XNLI benchmarks, demonstrating the possibility of multilingual modeling without sacrificing per-language performance. The authors will make their code, data, and models publicly available.This paper presents XLM-R, a Transformer-based multilingual masked language model trained on 100 languages using over 2 terabytes of filtered CommonCrawl data. XLM-R significantly outperforms multilingual BERT (mBERT) on various cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. It performs particularly well on low-resource languages, improving by 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. The paper also provides a detailed empirical analysis of key factors affecting performance, such as the trade-offs between positive transfer and capacity dilution, and the performance of high and low-resource languages at scale. Additionally, XLM-R is competitive with strong monolingual models on the GLUE and XNLI benchmarks, demonstrating the possibility of multilingual modeling without sacrificing per-language performance. The authors will make their code, data, and models publicly available.

Unsupervised Cross-lingual Representation Learning at Scale

July 5 - 10, 2020 | Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov

July 5 - 10, 2020 | Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov