MaLA-500: Massive Language Adaptation of Large Language Models

MaLA-500: Massive Language Adaptation of Large Language Models

3 Apr 2024 | Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André F. T. Martins, Hinrich Schütze
MaLA-500 is a novel large language model designed to cover 534 languages, addressing the gap in effectiveness for low-resource languages. It is trained using vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Intrinsic evaluation shows MaLA-500 performs better than existing multilingual LLMs in predicting texts of low-resource languages. Extrinsic evaluation on SIB200 and Taxi1500 demonstrates significant improvements, with 11.68% and 4.82% accuracy gains respectively. MaLA-500 is released on HuggingFace. The model uses Glot500-c data, which covers 534 languages, and extends the vocabulary of LLaMA 2 through a multilingual tokenizer. Continued pretraining with LoRA enables efficient training. The model is evaluated on intrinsic and extrinsic benchmarks, showing superior performance across languages. The study highlights the effectiveness of vocabulary extension and continued pretraining in enhancing multilingual capabilities. MaLA-500 is trained on a carbon-neutral data center, reducing environmental impact. The model's performance is compared with other LLMs, showing it outperforms them in multilingual tasks. The work contributes to expanding the accessibility of LLMs for diverse languages, especially low-resource ones, and addresses language barriers. The study also discusses related work in multilingual language models and language adaptation, emphasizing the need for further research in massive language adaptation for diverse languages.MaLA-500 is a novel large language model designed to cover 534 languages, addressing the gap in effectiveness for low-resource languages. It is trained using vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Intrinsic evaluation shows MaLA-500 performs better than existing multilingual LLMs in predicting texts of low-resource languages. Extrinsic evaluation on SIB200 and Taxi1500 demonstrates significant improvements, with 11.68% and 4.82% accuracy gains respectively. MaLA-500 is released on HuggingFace. The model uses Glot500-c data, which covers 534 languages, and extends the vocabulary of LLaMA 2 through a multilingual tokenizer. Continued pretraining with LoRA enables efficient training. The model is evaluated on intrinsic and extrinsic benchmarks, showing superior performance across languages. The study highlights the effectiveness of vocabulary extension and continued pretraining in enhancing multilingual capabilities. MaLA-500 is trained on a carbon-neutral data center, reducing environmental impact. The model's performance is compared with other LLMs, showing it outperforms them in multilingual tasks. The work contributes to expanding the accessibility of LLMs for diverse languages, especially low-resource ones, and addresses language barriers. The study also discusses related work in multilingual language models and language adaptation, emphasizing the need for further research in massive language adaptation for diverse languages.
Reach us at info@study.space
Understanding MaLA-500%3A Massive Language Adaptation of Large Language Models