Latxa: An Open Language Model and Evaluation Suite for Basque

Latxa: An Open Language Model and Evaluation Suite for Basque

29 Mar 2024 | Julen Etxaniz*, Oscar Sainz*, Naiara Perez*, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa
Latxa is an open-source family of large language models (LLMs) for Basque, ranging from 7 to 70 billion parameters. These models are based on Llama 2 and trained on a new Basque corpus comprising 4.3 million documents and 4.2 billion tokens. To address the scarcity of high-quality benchmarks for Basque, four new multiple-choice evaluation datasets are introduced: EusProficiency, EusReading, EusTrivia, and EusExams. Latxa outperforms all previous open models and GPT-3.5 Turbo in most tasks, with the 70B model outperforming the previous best open model by 18.89 points. The models are publicly available under open licenses, facilitating reproducible research on methods for building LLMs for low-resource languages. The paper also discusses the impact of model scale and the general vs. language-specific knowledge of LLMs, suggesting that continued pretraining with stronger English-centric models could lead to better Basque models.Latxa is an open-source family of large language models (LLMs) for Basque, ranging from 7 to 70 billion parameters. These models are based on Llama 2 and trained on a new Basque corpus comprising 4.3 million documents and 4.2 billion tokens. To address the scarcity of high-quality benchmarks for Basque, four new multiple-choice evaluation datasets are introduced: EusProficiency, EusReading, EusTrivia, and EusExams. Latxa outperforms all previous open models and GPT-3.5 Turbo in most tasks, with the 70B model outperforming the previous best open model by 18.89 points. The models are publicly available under open licenses, facilitating reproducible research on methods for building LLMs for low-resource languages. The paper also discusses the impact of model scale and the general vs. language-specific knowledge of LLMs, suggesting that continued pretraining with stronger English-centric models could lead to better Basque models.
Reach us at info@study.space
Understanding Latxa%3A An Open Language Model and Evaluation Suite for Basque