A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

5 Jun 2024 | Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
A significant portion of the web is translated into multiple languages, with many of these translations being of low quality, suggesting they were created using machine translation (MT). The study reveals that multi-way parallel translations dominate the content in lower resource languages and constitute a large fraction of the total web content in those languages. The research also finds evidence of a selection bias, where lower quality English content is translated into many lower resource languages via MT. This raises concerns about training multilingual large language models on web-scraped data, as MT-generated content may lead to less fluent models with more hallucinations. The study also highlights that multi-way parallel data is shorter, simpler, and more likely to be from the CONVERSATION & OPINION topic. The findings suggest that MT is a major source of web content in lower resource languages, and that this content is often of lower quality. The study also shows that multi-way parallel data has a higher LASER margin score, which is consistent with it being MT-generated. The research provides insights into the characteristics of web content and the challenges of training multilingual models on web data. The study also highlights the importance of data quality in training large language models and the need for filtering noise from web-scraped data. The research creates the largest multi-way corpus to date, consisting of 6.4B unique sentences in 90 languages, and releases code to reproduce the corpus and analysis. The study also discusses the limitations of the research, including the focus on common languages and the challenges of analyzing data at the sentence level. The findings have implications for the development of multilingual models and the use of web data in training.A significant portion of the web is translated into multiple languages, with many of these translations being of low quality, suggesting they were created using machine translation (MT). The study reveals that multi-way parallel translations dominate the content in lower resource languages and constitute a large fraction of the total web content in those languages. The research also finds evidence of a selection bias, where lower quality English content is translated into many lower resource languages via MT. This raises concerns about training multilingual large language models on web-scraped data, as MT-generated content may lead to less fluent models with more hallucinations. The study also highlights that multi-way parallel data is shorter, simpler, and more likely to be from the CONVERSATION & OPINION topic. The findings suggest that MT is a major source of web content in lower resource languages, and that this content is often of lower quality. The study also shows that multi-way parallel data has a higher LASER margin score, which is consistent with it being MT-generated. The research provides insights into the characteristics of web content and the challenges of training multilingual models on web data. The study also highlights the importance of data quality in training large language models and the need for filtering noise from web-scraped data. The research creates the largest multi-way corpus to date, consisting of 6.4B unique sentences in 90 languages, and releases code to reproduce the corpus and analysis. The study also discusses the limitations of the research, including the focus on common languages and the challenges of analyzing data at the sentence level. The findings have implications for the development of multilingual models and the use of web data in training.
Reach us at info@study.space