[slides and audio] XTREME%3A A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is introduced to evaluate the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. XTREME focuses on zero-shot cross-lingual transfer, where models are trained on English data and evaluated on other languages. The benchmark covers a diverse set of tasks, including natural language inference, paraphrase identification, part-of-speech tagging, named entity recognition, question answering, and sentence retrieval. The evaluation reveals that while models achieve human-level performance on English tasks, there is a significant gap in performance on other languages, particularly on syntactic and sentence retrieval tasks. The benchmark also highlights differences in performance across language families, with Indo-European languages performing better than Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages. The paper provides strong baselines, an online platform, and detailed analysis to encourage research on cross-lingual learning methods.The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is introduced to evaluate the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. XTREME focuses on zero-shot cross-lingual transfer, where models are trained on English data and evaluated on other languages. The benchmark covers a diverse set of tasks, including natural language inference, paraphrase identification, part-of-speech tagging, named entity recognition, question answering, and sentence retrieval. The evaluation reveals that while models achieve human-level performance on English tasks, there is a significant gap in performance on other languages, particularly on syntactic and sentence retrieval tasks. The benchmark also highlights differences in performance across language families, with Indo-European languages performing better than Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages. The paper provides strong baselines, an online platform, and detailed analysis to encourage research on cross-lingual learning methods.

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

4 Sep 2020 | Junjie Hu * 1 Sebastian Ruder * 2 Aditya Siddhant 3 Graham Neubig 1 Orhan Firat 3 Melvin Johnson 3