LLaMA Beyond English: An Empirical Study on Language Capability Transfer

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

12 Jan 2024 | Jun Zhao*, Zhihao Zhang*, Luhui Gao, Qi Zhang*, Tao Gui, Xuanjing Huang
LLaMA Beyond English: An Empirical Study on Language Capability Transfer This paper investigates how to effectively transfer the language generation and instruction-following capabilities of LLaMA to non-English languages. The study uses LLaMA as a base model and conducts extensive empirical research, accumulating over 1440 GPU hours. The research analyzes the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. Four standardized testing benchmarks (C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench) are used to assess the model's knowledge level, while LLM-Eval is used to evaluate response quality based on 17 diverse instruction categories. The results show that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. The study also finds that vocabulary extension is not a suitable choice for small-scale incremental pretraining in the order of tens of billions of tokens. Training scales required for effective transfer are analyzed, and it is found that further Chinese pretraining with 100 billion tokens or fewer is insufficient to significantly improve LLaMA's knowledge level, but enhancing response quality requires only hundreds of thousands of instruction data. The study also examines the effect of transfer training on the original English capabilities of LLaMA. It finds that exclusive reliance on Chinese corpora for transfer training markedly compromises LLaMA's original English proficiency, a concern alleviated effectively through multilingual joint training. The findings enable the transfer of LLaMA's language generation and instruction-following capabilities to non-English languages at minimal cost. The results from four standardized testing benchmarks and LLM-Eval show that the model achieves comparable knowledge level and response quality to the state-of-the-art Open Chinese LLaMA, using less than 1% of the training data. Extension experiments on another 13 low-resource languages also exhibit similar trends. The study also investigates the phenomenon of code-switching during transfer training, suggesting that cross-lingual alignment might have been internalized within the model. The findings provide assistance and guidance to the community in developing non-English LLMs.LLaMA Beyond English: An Empirical Study on Language Capability Transfer This paper investigates how to effectively transfer the language generation and instruction-following capabilities of LLaMA to non-English languages. The study uses LLaMA as a base model and conducts extensive empirical research, accumulating over 1440 GPU hours. The research analyzes the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. Four standardized testing benchmarks (C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench) are used to assess the model's knowledge level, while LLM-Eval is used to evaluate response quality based on 17 diverse instruction categories. The results show that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. The study also finds that vocabulary extension is not a suitable choice for small-scale incremental pretraining in the order of tens of billions of tokens. Training scales required for effective transfer are analyzed, and it is found that further Chinese pretraining with 100 billion tokens or fewer is insufficient to significantly improve LLaMA's knowledge level, but enhancing response quality requires only hundreds of thousands of instruction data. The study also examines the effect of transfer training on the original English capabilities of LLaMA. It finds that exclusive reliance on Chinese corpora for transfer training markedly compromises LLaMA's original English proficiency, a concern alleviated effectively through multilingual joint training. The findings enable the transfer of LLaMA's language generation and instruction-following capabilities to non-English languages at minimal cost. The results from four standardized testing benchmarks and LLM-Eval show that the model achieves comparable knowledge level and response quality to the state-of-the-art Open Chinese LLaMA, using less than 1% of the training data. Extension experiments on another 13 low-resource languages also exhibit similar trends. The study also investigates the phenomenon of code-switching during transfer training, suggesting that cross-lingual alignment might have been internalized within the model. The findings provide assistance and guidance to the community in developing non-English LLMs.
Reach us at info@study.space
[slides] LLaMA Beyond English%3A An Empirical Study on Language Capability Transfer | StudySpace