20 Mar 2024 | Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen and Jörg Tiedemann
The paper introduces the HPLT (High Performance Language Technologies) language resources, a massive multilingual dataset comprising both monolingual and bilingual corpora. The dataset is derived from web crawls by the Internet Archive and CommonCrawl, covering 75 languages and approximately 5.6 trillion word tokens. The English-centric parallel corpus includes 18 language pairs with over 96 million aligned sentence pairs and 1.4 billion English tokens. The HPLT resources are one of the largest open text corpora, providing significant value for language modeling and machine translation training. The authors describe their methods for data acquisition, management, and processing, which rely on open-source software tools and high-performance computing. The datasets are released under a permissive CC0 license, along with the software and tools used in the project. The paper also discusses the challenges and future directions in building large-scale multilingual datasets, emphasizing the importance of environmental considerations and the need for further research in language resource development.The paper introduces the HPLT (High Performance Language Technologies) language resources, a massive multilingual dataset comprising both monolingual and bilingual corpora. The dataset is derived from web crawls by the Internet Archive and CommonCrawl, covering 75 languages and approximately 5.6 trillion word tokens. The English-centric parallel corpus includes 18 language pairs with over 96 million aligned sentence pairs and 1.4 billion English tokens. The HPLT resources are one of the largest open text corpora, providing significant value for language modeling and machine translation training. The authors describe their methods for data acquisition, management, and processing, which rely on open-source software tools and high-performance computing. The datasets are released under a permissive CC0 license, along with the software and tools used in the project. The paper also discusses the challenges and future directions in building large-scale multilingual datasets, emphasizing the importance of environmental considerations and the need for further research in language resource development.