Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer

Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer

5 Apr 2024 | Hele-Andra Kuulmets, Taido Purason, Agnes Luhtaru, Mark Fishel
This paper explores cost-effective methods to adapt large language models (LLMs) to low-resource languages, focusing on Estonian. Using the Llama 2 model, the study investigates the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Results show that even a small amount of monolingual pretraining followed by cross-lingual instruction-tuning significantly improves performance on Estonian. The study also demonstrates cross-lingual knowledge transfer from high-quality English instructions to Estonian, enhancing commonsense reasoning and multi-turn conversation capabilities. The best model, LLAMMAS, is the first open-source instruction-following LLM for Estonian. Additionally, the first general task instruction dataset for Estonia, Alpaca-est, is published. These contributions mark the initial progress in developing open-source LLMs for Estonian. The study also evaluates the performance of the models on various tasks, including translation, grammatical error correction, and question-answering. The results show that the models perform well on Estonian tasks, with some improvements in English tasks. The study also investigates the effects of different training strategies and datasets on model performance. The findings suggest that combining high-quality English instructions with cross-lingual instruction-tuning can enhance model performance on Estonian tasks. The study highlights the importance of cross-lingual knowledge transfer and the effectiveness of using high-quality instruction datasets in improving model performance. The results indicate that the models can be used for a variety of tasks, including translation and grammatical error correction, with competitive performance compared to other models. The study also discusses the limitations of the work, including the dependence on data generated with proprietary LLMs and the limited number of benchmarks for Estonian. The study concludes that the work represents a significant step towards developing open-source LLMs for Estonian.This paper explores cost-effective methods to adapt large language models (LLMs) to low-resource languages, focusing on Estonian. Using the Llama 2 model, the study investigates the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Results show that even a small amount of monolingual pretraining followed by cross-lingual instruction-tuning significantly improves performance on Estonian. The study also demonstrates cross-lingual knowledge transfer from high-quality English instructions to Estonian, enhancing commonsense reasoning and multi-turn conversation capabilities. The best model, LLAMMAS, is the first open-source instruction-following LLM for Estonian. Additionally, the first general task instruction dataset for Estonia, Alpaca-est, is published. These contributions mark the initial progress in developing open-source LLMs for Estonian. The study also evaluates the performance of the models on various tasks, including translation, grammatical error correction, and question-answering. The results show that the models perform well on Estonian tasks, with some improvements in English tasks. The study also investigates the effects of different training strategies and datasets on model performance. The findings suggest that combining high-quality English instructions with cross-lingual instruction-tuning can enhance model performance on Estonian tasks. The study highlights the importance of cross-lingual knowledge transfer and the effectiveness of using high-quality instruction datasets in improving model performance. The results indicate that the models can be used for a variety of tasks, including translation and grammatical error correction, with competitive performance compared to other models. The study also discusses the limitations of the work, including the dependence on data generated with proprietary LLMs and the limited number of benchmarks for Estonian. The study concludes that the work represents a significant step towards developing open-source LLMs for Estonian.
Reach us at info@study.space