UNSUPERVISED MACHINE TRANSLATION USING MONOLINGUAL CORPORA ONLY

UNSUPERVISED MACHINE TRANSLATION USING MONOLINGUAL CORPORA ONLY

13 Apr 2018 | Guillaume Lample † ‡, Alexis Conneau †, Ludovic Denoyer ‡, Marc’Aurelio Ranzato †
This paper presents a novel approach to unsupervised machine translation using only monolingual corpora. The proposed model learns to translate between two languages by mapping sentences from both languages into a shared latent space. The model is trained to reconstruct sentences in both languages from this shared space, enabling translation without any labeled data. The key idea is to build a common latent space between the two languages and to learn to translate by reconstructing in both domains according to two principles: (i) the model must be able to reconstruct a sentence in a given language from a noisy version of it, as in standard denoising auto-encoders, and (ii) the model must learn to reconstruct any source sentence given a noisy translation of the same sentence in the target domain, and vice versa. To ensure alignment of the latent distributions of sentences in the source and target domains, the model also learns a discriminator in an adversarial setting. This procedure is iteratively repeated, leading to increasingly accurate translation models. The model is initialized with a naive unsupervised translation model based on word-by-word translation using a bilingual lexicon derived from monolingual data. The model is tested on two widely used datasets, achieving BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, respectively, without using any parallel sentences during training. The results show that the model can achieve performance comparable to supervised approaches trained on large parallel datasets. The model is trained using an iterative algorithm that starts with an initial translation model and then improves it through reconstruction and adversarial training. The model is evaluated on multiple language pairs and datasets, demonstrating its effectiveness in low-resource settings where parallel data is scarce. The approach is fully unsupervised and does not require any labeled data, making it suitable for languages with limited parallel data. The model is trained using a sequence-to-sequence architecture with attention, and the encoder and decoder are trained to reconstruct and translate sentences from a noisy version of the input. The model is evaluated using BLEU scores, and the results show that it achieves high performance on both the Multi30k and WMT datasets. The model is also compared to other baselines, including word-by-word translation and word reordering, and is shown to outperform these methods. The model is further validated through ablation studies, which demonstrate the importance of the different components of the system, including the adversarial training and the auto-encoding loss. The results show that the model can learn effective translation models without any supervision, making it a promising approach for low-resource languages.This paper presents a novel approach to unsupervised machine translation using only monolingual corpora. The proposed model learns to translate between two languages by mapping sentences from both languages into a shared latent space. The model is trained to reconstruct sentences in both languages from this shared space, enabling translation without any labeled data. The key idea is to build a common latent space between the two languages and to learn to translate by reconstructing in both domains according to two principles: (i) the model must be able to reconstruct a sentence in a given language from a noisy version of it, as in standard denoising auto-encoders, and (ii) the model must learn to reconstruct any source sentence given a noisy translation of the same sentence in the target domain, and vice versa. To ensure alignment of the latent distributions of sentences in the source and target domains, the model also learns a discriminator in an adversarial setting. This procedure is iteratively repeated, leading to increasingly accurate translation models. The model is initialized with a naive unsupervised translation model based on word-by-word translation using a bilingual lexicon derived from monolingual data. The model is tested on two widely used datasets, achieving BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, respectively, without using any parallel sentences during training. The results show that the model can achieve performance comparable to supervised approaches trained on large parallel datasets. The model is trained using an iterative algorithm that starts with an initial translation model and then improves it through reconstruction and adversarial training. The model is evaluated on multiple language pairs and datasets, demonstrating its effectiveness in low-resource settings where parallel data is scarce. The approach is fully unsupervised and does not require any labeled data, making it suitable for languages with limited parallel data. The model is trained using a sequence-to-sequence architecture with attention, and the encoder and decoder are trained to reconstruct and translate sentences from a noisy version of the input. The model is evaluated using BLEU scores, and the results show that it achieves high performance on both the Multi30k and WMT datasets. The model is also compared to other baselines, including word-by-word translation and word reordering, and is shown to outperform these methods. The model is further validated through ablation studies, which demonstrate the importance of the different components of the system, including the adversarial training and the auto-encoding loss. The results show that the model can learn effective translation models without any supervision, making it a promising approach for low-resource languages.
Reach us at info@study.space