29 Mar 2024 | Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa
Latxa is an open-source family of large language models (LLMs) for the Basque language, ranging from 7 to 70 billion parameters. It is based on Llama 2 and continues pretraining on a new Basque corpus of 4.3 million documents and 4.2 billion tokens. To address the lack of high-quality benchmarks for Basque, four multiple-choice evaluation datasets are introduced: EusProficiency (language proficiency), EusReading (reading comprehension), EusTrivia (trivia questions), and EusExams (public examinations). These datasets contain a total of 23,282 questions, covering various aspects of Basque language knowledge.
Latxa outperforms all previous open models by a large margin, and is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. The Latxa family of models, along with the new pretraining corpus and evaluation datasets, are publicly available under open licenses. This enables reproducible research on methods to build LLMs for low-resource languages.
The training corpus combines various existing datasets, including EusCrawl v1.1, Egunkaria, Booktegi, Wikipedia, CulturaX, Colossal OSCAR, and HPLT v1. The data is carefully deduplicated and filtered to ensure quality. The preprocessed corpus consists of 1.22 billion words and 4.17 billion Llama 2 tokens.
Latxa models are trained using the GPT-Neox library on the CINECA HPC Leonardo computing cluster. The 7B, 13B, and 70B models are trained for 10,000 steps with a sequence length of 4,096 tokens and an effective batch size of 1 million tokens, resulting in a total of 10 billion tokens. The training uses a cosine learning rate schedule with a warm-up of 500 steps and a decay to 3% of the peak learning rate.
The Latxa models are evaluated on a variety of tasks, including Belebele, XStoryCloze, and BasqueGLUE. The results show that Latxa outperforms previous open models and GPT-3.5 Turbo, but lags behind GPT-4 Turbo in most benchmarks. However, Latxa 70B outperforms GPT-4 Turbo on EusProficiency and Language & Literature EusTrivia questions, suggesting that the capabilities of LLMs in a particular language are not determined by their linguistic competence in this language.
The Latxa models are also evaluated on classical NLP tasks such as topic classification and coreference detection. The results show that the best results are obtained by specialized encoder-only models, which shows that the traditional pretraining/fine-tuning paradigm with BERT-style models is still competitive for classicalLatxa is an open-source family of large language models (LLMs) for the Basque language, ranging from 7 to 70 billion parameters. It is based on Llama 2 and continues pretraining on a new Basque corpus of 4.3 million documents and 4.2 billion tokens. To address the lack of high-quality benchmarks for Basque, four multiple-choice evaluation datasets are introduced: EusProficiency (language proficiency), EusReading (reading comprehension), EusTrivia (trivia questions), and EusExams (public examinations). These datasets contain a total of 23,282 questions, covering various aspects of Basque language knowledge.
Latxa outperforms all previous open models by a large margin, and is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. The Latxa family of models, along with the new pretraining corpus and evaluation datasets, are publicly available under open licenses. This enables reproducible research on methods to build LLMs for low-resource languages.
The training corpus combines various existing datasets, including EusCrawl v1.1, Egunkaria, Booktegi, Wikipedia, CulturaX, Colossal OSCAR, and HPLT v1. The data is carefully deduplicated and filtered to ensure quality. The preprocessed corpus consists of 1.22 billion words and 4.17 billion Llama 2 tokens.
Latxa models are trained using the GPT-Neox library on the CINECA HPC Leonardo computing cluster. The 7B, 13B, and 70B models are trained for 10,000 steps with a sequence length of 4,096 tokens and an effective batch size of 1 million tokens, resulting in a total of 10 billion tokens. The training uses a cosine learning rate schedule with a warm-up of 500 steps and a decay to 3% of the peak learning rate.
The Latxa models are evaluated on a variety of tasks, including Belebele, XStoryCloze, and BasqueGLUE. The results show that Latxa outperforms previous open models and GPT-3.5 Turbo, but lags behind GPT-4 Turbo in most benchmarks. However, Latxa 70B outperforms GPT-4 Turbo on EusProficiency and Language & Literature EusTrivia questions, suggesting that the capabilities of LLMs in a particular language are not determined by their linguistic competence in this language.
The Latxa models are also evaluated on classical NLP tasks such as topic classification and coreference detection. The results show that the best results are obtained by specialized encoder-only models, which shows that the traditional pretraining/fine-tuning paradigm with BERT-style models is still competitive for classical