Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

19 Jun 2024 | Aleksandr Nikolic, Konstantin Korolev, Igor Kiselev, Artem Shelmanov
Vikhr is a new open-source instruction-tuned large language model (LLM) specifically designed for the Russian language. It addresses the challenges of text generation for non-English languages, including poor generation quality and computational inefficiency due to token representation imbalances. Unlike previous efforts that used LoRA adapters on English models, Vikhr features an adapted tokenizer and undergoes full pre-training and instruction tuning. This approach improves performance, computational efficiency, and contextual understanding. Vikhr outperforms some proprietary models on Russian benchmarks and is publicly available. The model is built from an English-oriented LLM, with a custom tokenizer trained on Russian data. This improves tokenization efficiency and reduces computational overhead. Continued pre-training on Russian corpora, combined with instruction tuning on expanded datasets, enables Vikhr to achieve state-of-the-art results. The model also maintains strong performance in English. Vikhr's construction pipeline includes vocabulary adaptation, continued pre-training, and instruction tuning. It uses a SentencePiece tokenizer trained on the RuLM dataset, improving efficiency compared to the original English tokenizer. Continued pre-training is performed with regularization to prevent catastrophic forgetting, and instruction tuning is done using a novel set of Russian instruction-output pairs, including translations of English datasets. Vikhr was trained on eight NVIDIA A100 GPUs, with 1,000 GPU hours for pre-training and 60 hours for instruction tuning. It outperforms several open-source and proprietary models on Russian benchmarks, including MMLU, Ru-MMLU, CheGeKa, Russian SuperGLUE, and MERA. While it slightly underperforms some proprietary models on certain benchmarks, it outperforms them on most tasks. The paper highlights the importance of instruction tuning for achieving high zero-shot performance and natural communication with LLMs. It also notes the limitations of current models, including the lack of culture-specific knowledge and the need for further research in alignment techniques. Vikhr represents a significant advancement in open-source Russian LLMs, offering efficient and effective instruction-following capabilities for multilingual natural language processing research.Vikhr is a new open-source instruction-tuned large language model (LLM) specifically designed for the Russian language. It addresses the challenges of text generation for non-English languages, including poor generation quality and computational inefficiency due to token representation imbalances. Unlike previous efforts that used LoRA adapters on English models, Vikhr features an adapted tokenizer and undergoes full pre-training and instruction tuning. This approach improves performance, computational efficiency, and contextual understanding. Vikhr outperforms some proprietary models on Russian benchmarks and is publicly available. The model is built from an English-oriented LLM, with a custom tokenizer trained on Russian data. This improves tokenization efficiency and reduces computational overhead. Continued pre-training on Russian corpora, combined with instruction tuning on expanded datasets, enables Vikhr to achieve state-of-the-art results. The model also maintains strong performance in English. Vikhr's construction pipeline includes vocabulary adaptation, continued pre-training, and instruction tuning. It uses a SentencePiece tokenizer trained on the RuLM dataset, improving efficiency compared to the original English tokenizer. Continued pre-training is performed with regularization to prevent catastrophic forgetting, and instruction tuning is done using a novel set of Russian instruction-output pairs, including translations of English datasets. Vikhr was trained on eight NVIDIA A100 GPUs, with 1,000 GPU hours for pre-training and 60 hours for instruction tuning. It outperforms several open-source and proprietary models on Russian benchmarks, including MMLU, Ru-MMLU, CheGeKa, Russian SuperGLUE, and MERA. While it slightly underperforms some proprietary models on certain benchmarks, it outperforms them on most tasks. The paper highlights the importance of instruction tuning for achieving high zero-shot performance and natural communication with LLMs. It also notes the limitations of current models, including the lack of culture-specific knowledge and the need for further research in alignment techniques. Vikhr represents a significant advancement in open-source Russian LLMs, offering efficient and effective instruction-following capabilities for multilingual natural language processing research.
Reach us at info@study.space