[slides and audio] A New Massive Multilingual Dataset for High-Performance Language Technologies

The HPLT (High Performance Language Technologies) language resources are a new massive multilingual dataset containing both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. The dataset includes 75 languages with approximately 5.6 trillion word tokens and 18 language pairs with over 96 million aligned sentence pairs, including more than 1.4 billion English tokens. The dataset is one of the largest open text corpora ever released, providing valuable resources for language modeling and machine translation training. The HPLT language resources are released under the permissive cCO license and include the corpora, software, and tools used in this work. The dataset was created using open-source software tools and high-performance computing. The monolingual collection focuses on low- to medium-resourced languages and covers 75 languages. The English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs. The dataset also includes synthetic datasets obtained by pivoting through English, covering 171 language pairs. The HPLT language resources also include 22 MT models for fast translation and bilingual document alignment and 9 new Bicleaner models for sentence pair scoring. The dataset was created by processing large web crawls from the Internet Archive and CommonCrawl. The data was extracted, cleaned, and processed using various tools and techniques, including the Bitextor pipeline. The dataset includes metadata for each document, such as the source URL, language, and scores. The monolingual and bilingual text processing pipelines were developed separately, with the monolingual text processed using the Monotextor pipeline and the bilingual text processed using the Bitextor pipeline. The HPLT language resources are a valuable resource for language modeling and machine translation training. The dataset includes a wide range of languages, including high-resource languages such as English, Chinese, and Russian, as well as low-resource languages such as Esperanto and Pashto. The dataset is released under the permissive cCO license and includes the corpora, software, and tools used in this work. The dataset is also available for download and use by the research community. The HPLT language resources are an important contribution to the field of natural language processing and provide a valuable resource for researchers and developers working on language technologies.The HPLT (High Performance Language Technologies) language resources are a new massive multilingual dataset containing both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. The dataset includes 75 languages with approximately 5.6 trillion word tokens and 18 language pairs with over 96 million aligned sentence pairs, including more than 1.4 billion English tokens. The dataset is one of the largest open text corpora ever released, providing valuable resources for language modeling and machine translation training. The HPLT language resources are released under the permissive cCO license and include the corpora, software, and tools used in this work. The dataset was created using open-source software tools and high-performance computing. The monolingual collection focuses on low- to medium-resourced languages and covers 75 languages. The English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs. The dataset also includes synthetic datasets obtained by pivoting through English, covering 171 language pairs. The HPLT language resources also include 22 MT models for fast translation and bilingual document alignment and 9 new Bicleaner models for sentence pair scoring. The dataset was created by processing large web crawls from the Internet Archive and CommonCrawl. The data was extracted, cleaned, and processed using various tools and techniques, including the Bitextor pipeline. The dataset includes metadata for each document, such as the source URL, language, and scores. The monolingual and bilingual text processing pipelines were developed separately, with the monolingual text processed using the Monotextor pipeline and the bilingual text processed using the Bitextor pipeline. The HPLT language resources are a valuable resource for language modeling and machine translation training. The dataset includes a wide range of languages, including high-resource languages such as English, Chinese, and Russian, as well as low-resource languages such as Esperanto and Pashto. The dataset is released under the permissive cCO license and includes the corpora, software, and tools used in this work. The dataset is also available for download and use by the research community. The HPLT language resources are an important contribution to the field of natural language processing and provide a valuable resource for researchers and developers working on language technologies.

A New Massive Multilingual Dataset for High-Performance Language Technologies

20 Mar 2024 | Ona de Gibert¹, Graeme NaiI², Nikolay Arefyev³, Marta Bañón⁴, Jelmer van der Linde², Shaoxiong Ji¹, Jaume Zaragoza-Bernabeu⁴, Mikko Aulamo¹, Gema Ramírez-Sánchez⁴, Andrey Kutuzov³, Sampo Pyysalo⁵, Stephan Oepen³ and Jörg Tiedemann¹