11 Mar 2021 | Mihir Kale, Linting Xue*, Noah Constant*, Adam Roberts*, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel
mT5 is a multilingual variant of the Text-to-Text Transfer Transformer (T5) model, pre-trained on a new Common Crawl-based dataset covering 101 languages. It was designed to achieve state-of-the-art performance on multilingual benchmarks and addresses the issue of "accidental translation" in zero-shot settings. mT5 inherits the benefits of T5, including its text-to-text format, empirical study-based design, and scale. The mT5 model is trained on the mC4 dataset, which includes text in 101 languages. The dataset was created by using cld3 to identify over 100 languages and applying filtering steps to ensure quality. The model architecture and training procedure closely follow T5, with improvements such as using GeGLU nonlinearities and scaling both d_model and d_ff. The model also uses a language sampling strategy to balance training across languages. mT5 outperforms existing multilingual models on various benchmarks, including XNLI, XQuAD, and TyDi QA. It also addresses the issue of accidental translation by incorporating unlabeled pre-training data during fine-tuning. The model is publicly available, allowing the community to leverage its performance. mT5 demonstrates strong performance on a wide range of multilingual tasks, highlighting the effectiveness of its design and training approach.mT5 is a multilingual variant of the Text-to-Text Transfer Transformer (T5) model, pre-trained on a new Common Crawl-based dataset covering 101 languages. It was designed to achieve state-of-the-art performance on multilingual benchmarks and addresses the issue of "accidental translation" in zero-shot settings. mT5 inherits the benefits of T5, including its text-to-text format, empirical study-based design, and scale. The mT5 model is trained on the mC4 dataset, which includes text in 101 languages. The dataset was created by using cld3 to identify over 100 languages and applying filtering steps to ensure quality. The model architecture and training procedure closely follow T5, with improvements such as using GeGLU nonlinearities and scaling both d_model and d_ff. The model also uses a language sampling strategy to balance training across languages. mT5 outperforms existing multilingual models on various benchmarks, including XNLI, XQuAD, and TyDi QA. It also addresses the issue of accidental translation by incorporating unlabeled pre-training data during fine-tuning. The model is publicly available, allowing the community to leverage its performance. mT5 demonstrates strong performance on a wide range of multilingual tasks, highlighting the effectiveness of its design and training approach.