Poro 34B and the Blessing of Multilinguality

Poro 34B and the Blessing of Multilinguality

24 Apr 2024 | Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Arne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo
The paper introduces Poro 34B, a 34 billion parameter language model trained on 1 trillion tokens of Finnish, English, and programming languages. The model is designed to address the challenge of limited data for smaller languages by leveraging multilingual training. The authors argue that multilinguality can be a blessing rather than a curse, as it can significantly enhance the capabilities of models for small languages. Poro 34B is trained with a custom tokenizer and a unique pretraining setup, including cross-lingual data and instruction-formatted translation examples. The model is evaluated on various tasks, including Finnish language generation, English language generation, and code generation, demonstrating superior performance compared to existing models. The paper also discusses the limitations of certain benchmarks and the importance of open science and transparency in releasing the model and its associated resources.The paper introduces Poro 34B, a 34 billion parameter language model trained on 1 trillion tokens of Finnish, English, and programming languages. The model is designed to address the challenge of limited data for smaller languages by leveraging multilingual training. The authors argue that multilinguality can be a blessing rather than a curse, as it can significantly enhance the capabilities of models for small languages. Poro 34B is trained with a custom tokenizer and a unique pretraining setup, including cross-lingual data and instruction-formatted translation examples. The model is evaluated on various tasks, including Finnish language generation, English language generation, and code generation, demonstrating superior performance compared to existing models. The paper also discusses the limitations of certain benchmarks and the importance of open science and transparency in releasing the model and its associated resources.
Reach us at info@study.space
[slides] Poro 34B and the Blessing of Multilinguality | StudySpace