Understanding A Shocking Amount of the Web is Machine Translated%3A Insights from Multi-Way Parallelism

The paper investigates the prevalence and quality of multi-way parallel translations on the web, where a sentence is translated into multiple languages. The study finds that a significant portion of web content in lower resource languages is multi-way parallel, indicating that these translations are likely generated by Machine Translation (MT). The quality of these multi-way translations is generally lower compared to single-way translations, suggesting that they are less fluent and more prone to hallucinations. The research also reveals a selection bias in the type of content translated into multiple languages, which tends to be shorter, more predictable, and often from the CONVERSATION & OPINION topic. This bias is attributed to low-quality English content being translated en masse into lower resource languages for ad revenue generation. The findings raise concerns about the impact of MT-generated data on the training of multilingual large language models (LLMs) and highlight the need for better data quality and MT detection techniques. The study uses a large multi-way corpus, MWccMatrix, consisting of 6.4 billion unique sentences in 90 languages, to support its analysis.The paper investigates the prevalence and quality of multi-way parallel translations on the web, where a sentence is translated into multiple languages. The study finds that a significant portion of web content in lower resource languages is multi-way parallel, indicating that these translations are likely generated by Machine Translation (MT). The quality of these multi-way translations is generally lower compared to single-way translations, suggesting that they are less fluent and more prone to hallucinations. The research also reveals a selection bias in the type of content translated into multiple languages, which tends to be shorter, more predictable, and often from the CONVERSATION & OPINION topic. This bias is attributed to low-quality English content being translated en masse into lower resource languages for ad revenue generation. The findings raise concerns about the impact of MT-generated data on the training of multilingual large language models (LLMs) and highlight the need for better data quality and MT detection techniques. The study uses a large multi-way corpus, MWccMatrix, consisting of 6.4 billion unique sentences in 90 languages, to support its analysis.

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

5 Jun 2024 | Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico