No Language Left Behind: Scaling Human-Centered Machine Translation

No Language Left Behind: Scaling Human-Centered Machine Translation

| NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonzalez, Pranthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedamuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang
The paper "No Language Left Behind: Scaling Human-Centered Machine Translation" addresses the challenge of expanding machine translation to support a broader range of languages, particularly those with limited resources. The authors, from Meta AI and UC Berkeley, aim to break the 200-language barrier while ensuring high-quality and safe translations. They begin by contextualizing the need for low-resource language translation through interviews with native speakers, emphasizing the importance of community needs and ethical considerations. The team then creates datasets and models to bridge the performance gap between low and high-resource languages. Specifically, they develop a conditional compute model based on Sparsely Gated Mixture of Experts, trained using novel data mining techniques tailored for low-resource languages. The model is evaluated using a human-translated benchmark, FLORES-200, and a toxicity benchmark, achieving a 44% improvement in BLEU score over the previous state-of-the-art. The paper also discusses the technical challenges, such as data collection and model training, and the social impact of their work, including its potential to expand information access and promote cultural preservation. The team opens-source all contributions, including datasets, scripts, and models, to support further research and practical applications.The paper "No Language Left Behind: Scaling Human-Centered Machine Translation" addresses the challenge of expanding machine translation to support a broader range of languages, particularly those with limited resources. The authors, from Meta AI and UC Berkeley, aim to break the 200-language barrier while ensuring high-quality and safe translations. They begin by contextualizing the need for low-resource language translation through interviews with native speakers, emphasizing the importance of community needs and ethical considerations. The team then creates datasets and models to bridge the performance gap between low and high-resource languages. Specifically, they develop a conditional compute model based on Sparsely Gated Mixture of Experts, trained using novel data mining techniques tailored for low-resource languages. The model is evaluated using a human-translated benchmark, FLORES-200, and a toxicity benchmark, achieving a 44% improvement in BLEU score over the previous state-of-the-art. The paper also discusses the technical challenges, such as data collection and model training, and the social impact of their work, including its potential to expand information access and promote cultural preservation. The team opens-source all contributions, including datasets, scripts, and models, to support further research and practical applications.
Reach us at info@study.space
[slides and audio] No Language Left Behind%3A Scaling Human-Centered Machine Translation