[slides and audio] Continual Pre-Training for Cross-Lingual LLM Adaptation%3A Enhancing Japanese Language Capabilities

This study explores the effectiveness of cross-lingual continual pre-training for enhancing Japanese language capabilities in large language models (LLMs). The researchers constructed Swallow, an LLM with improved Japanese capabilities by extending the vocabulary of LLaMA 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results show that Swallow significantly improves performance on Japanese tasks, with improvements monotonically increasing with the amount of training data up to 100B tokens. Swallow outperforms other LLMs trained from scratch in English and Japanese. The study also investigates the impact of vocabulary expansion and the use of parallel corpora. Vocabulary expansion improves computational efficiency without negatively affecting performance, except for automatic summarization. Using parallel corpora enhances translation ability without impacting other tasks. The findings provide insights into effective methodologies for cross-lingual continual pre-training, particularly for Japanese language models.This study explores the effectiveness of cross-lingual continual pre-training for enhancing Japanese language capabilities in large language models (LLMs). The researchers constructed Swallow, an LLM with improved Japanese capabilities by extending the vocabulary of LLaMA 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results show that Swallow significantly improves performance on Japanese tasks, with improvements monotonically increasing with the amount of training data up to 100B tokens. Swallow outperforms other LLMs trained from scratch in English and Japanese. The study also investigates the impact of vocabulary expansion and the use of parallel corpora. Vocabulary expansion improves computational efficiency without negatively affecting performance, except for automatic summarization. Using parallel corpora enhances translation ability without impacting other tasks. The findings provide insights into effective methodologies for cross-lingual continual pre-training, particularly for Japanese language models.

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

27 Apr 2024 | Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki