Understanding Multilingual Denoising Pre-training for Neural Machine Translation

This paper introduces *mBART*, a multilingual sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in multiple languages using the BART objective. mBART is the first method to pre-train a complete sequence-to-sequence model by denoising full texts in multiple languages, rather than focusing on only the encoder, decoder, or reconstructing parts of the text. The pre-trained model can be directly fine-tuned for both supervised and unsupervised machine translation tasks without task-specific modifications. Extensive experiments demonstrate that mBART initialization leads to significant performance gains in various machine translation benchmarks, including up to 12 BLEU points for low-resource MT and over 5 BLEU points for many document-level and unsupervised models. mBART also enables new types of transfer learning to language pairs with no bi-text or that were not in the pre-training corpus, suggesting that the initialization is at least partially language universal. The paper provides a detailed analysis of which factors contribute the most to effective pre-training, including the number of languages and their overall similarity.This paper introduces *mBART*, a multilingual sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in multiple languages using the BART objective. mBART is the first method to pre-train a complete sequence-to-sequence model by denoising full texts in multiple languages, rather than focusing on only the encoder, decoder, or reconstructing parts of the text. The pre-trained model can be directly fine-tuned for both supervised and unsupervised machine translation tasks without task-specific modifications. Extensive experiments demonstrate that mBART initialization leads to significant performance gains in various machine translation benchmarks, including up to 12 BLEU points for low-resource MT and over 5 BLEU points for many document-level and unsupervised models. mBART also enables new types of transfer learning to language pairs with no bi-text or that were not in the pre-training corpus, suggesting that the initialization is at least partially language universal. The paper provides a detailed analysis of which factors contribute the most to effective pre-training, including the number of languages and their overall similarity.

Multilingual Denoising Pre-training for Neural Machine Translation

23 Jan 2020 | Yinhan Liu*, Jiatao Gu*, Naman Goyal*, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer

23 Jan 2020 | Yinhan Liu, Jiatao Gu, Naman Goyal*, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer