Understanding MaLA-500%3A Massive Language Adaptation of Large Language Models

The paper introduces MaLA-500, a novel large language model (LLM) designed to cover 534 languages, addressing the gap in effectiveness of existing LLMs for low-resource languages. MaLA-500 is trained using vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Intrinsic evaluation shows that MaLA-500 outperforms existing multilingual LLMs in predicting texts of low-resource languages. Extrinsically, MaLA-500 achieves significant improvements on SIB200 and Taxi1500 benchmarks, outperforming previous LLMs by 11.68% and 4.82% in macro-average accuracy, respectively. The paper also discusses the challenges and methods used in massive language adaptation, including vocabulary extension and continued pretraining, and provides a detailed analysis of the model's performance across different languages and benchmarks. The work broadens the accessibility of LLMs to a diverse set of languages, particularly those underrepresented in existing models.The paper introduces MaLA-500, a novel large language model (LLM) designed to cover 534 languages, addressing the gap in effectiveness of existing LLMs for low-resource languages. MaLA-500 is trained using vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Intrinsic evaluation shows that MaLA-500 outperforms existing multilingual LLMs in predicting texts of low-resource languages. Extrinsically, MaLA-500 achieves significant improvements on SIB200 and Taxi1500 benchmarks, outperforming previous LLMs by 11.68% and 4.82% in macro-average accuracy, respectively. The paper also discusses the challenges and methods used in massive language adaptation, including vocabulary extension and continued pretraining, and provides a detailed analysis of the model's performance across different languages and benchmarks. The work broadens the accessibility of LLMs to a diverse set of languages, particularly those underrepresented in existing models.

MaLA-500: Massive Language Adaptation of Large Language Models

3 Apr 2024 | Peiqin Lin,1,2, Shaoxiong Ji3, Jörg Tiedemann3, André F. T. Martins4,5,6, Hinrich Schütze1,2

MaLA-500: Massive Language Adaptation of Large Language Models

3 Apr 2024 | Peiqin Lin*,1,2, Shaoxiong Ji*3, Jörg Tiedemann3, André F. T. Martins4,5,6, Hinrich Schütze1,2

3 Apr 2024 | Peiqin Lin,1,2, Shaoxiong Ji3, Jörg Tiedemann3, André F. T. Martins4,5,6, Hinrich Schütze1,2