LLaMA Beyond English: An Empirical Study on Language Capability Transfer

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

12 Jan 2024 | Jun Zhao*, Zhihao Zhang*, Luhui Gao, Qi Zhang†, Tao Gui, Xuanjing Huang
This paper investigates the transfer of language generation and instruction-following capabilities from English to non-English languages using the LLaMA model. The authors conduct an extensive empirical study, analyzing the impact of vocabulary extension, further pretraining, and instruction tuning on the transfer process. They use four standardized testing benchmarks (C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench) and the LLM-Eval benchmark to evaluate the model's performance. Key findings include: 1. **Vocabulary Extension**: Extending the vocabulary with additional Chinese tokens significantly improves performance, even though the extended vocabulary has been further pretrained on a larger scale. 2. **Training Scales**: Further pretraining with 100 billion tokens or fewer is insufficient to significantly enhance the model's knowledge level. However, enhancing response quality (language generation) requires only a few hundred thousand instruction data. 3. **Original English Capabilities**: Exclusive reliance on Chinese corpora for transfer training compromises the model's original English proficiency, which can be alleviated through multilingual joint training. 4. **Low-Resource Languages**: Similar trends are observed across 13 low-resource languages, indicating that the findings are not language-specific. The results suggest that achieving comparable performance to state-of-the-art transfer models can be done with less than 1% of the pretraining data, making it more efficient and cost-effective to develop non-English LLMs. The paper provides valuable insights and guidance for researchers working on multilingual language models.This paper investigates the transfer of language generation and instruction-following capabilities from English to non-English languages using the LLaMA model. The authors conduct an extensive empirical study, analyzing the impact of vocabulary extension, further pretraining, and instruction tuning on the transfer process. They use four standardized testing benchmarks (C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench) and the LLM-Eval benchmark to evaluate the model's performance. Key findings include: 1. **Vocabulary Extension**: Extending the vocabulary with additional Chinese tokens significantly improves performance, even though the extended vocabulary has been further pretrained on a larger scale. 2. **Training Scales**: Further pretraining with 100 billion tokens or fewer is insufficient to significantly enhance the model's knowledge level. However, enhancing response quality (language generation) requires only a few hundred thousand instruction data. 3. **Original English Capabilities**: Exclusive reliance on Chinese corpora for transfer training compromises the model's original English proficiency, which can be alleviated through multilingual joint training. 4. **Low-Resource Languages**: Similar trends are observed across 13 low-resource languages, indicating that the findings are not language-specific. The results suggest that achieving comparable performance to state-of-the-art transfer models can be done with less than 1% of the pretraining data, making it more efficient and cost-effective to develop non-English LLMs. The paper provides valuable insights and guidance for researchers working on multilingual language models.
Reach us at info@study.space