This paper presents Swallow, an enhanced Japanese language large language model (LLM) developed through continual pre-training on a large Japanese web corpus. The model was created by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on Japanese data. Experimental results show that Swallow significantly outperforms other LLMs trained from scratch in Japanese and English, particularly in Japanese question answering tasks. The performance of Swallow improves monotonically with the amount of training data, reaching up to 100B tokens.
The study also investigates the effectiveness of vocabulary expansion and the use of parallel corpora in cross-lingual continual pre-training. Vocabulary expansion was found to improve Japanese text generation efficiency without negatively impacting performance, except for summarization. The use of parallel corpora enhances translation ability without affecting other tasks.
The paper highlights the benefits of continual pre-training for improving Japanese language capabilities, especially in knowledge-intensive tasks like question answering. It also demonstrates that continual pre-training is more efficient than training from scratch, as it requires fewer computational resources and achieves better performance. The study provides insights into the effectiveness of vocabulary expansion and parallel corpora in enhancing LLMs for non-English languages. The results show that Swallow achieves the highest performance in Japanese among all models developed in Japan as of December 2023.This paper presents Swallow, an enhanced Japanese language large language model (LLM) developed through continual pre-training on a large Japanese web corpus. The model was created by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on Japanese data. Experimental results show that Swallow significantly outperforms other LLMs trained from scratch in Japanese and English, particularly in Japanese question answering tasks. The performance of Swallow improves monotonically with the amount of training data, reaching up to 100B tokens.
The study also investigates the effectiveness of vocabulary expansion and the use of parallel corpora in cross-lingual continual pre-training. Vocabulary expansion was found to improve Japanese text generation efficiency without negatively impacting performance, except for summarization. The use of parallel corpora enhances translation ability without affecting other tasks.
The paper highlights the benefits of continual pre-training for improving Japanese language capabilities, especially in knowledge-intensive tasks like question answering. It also demonstrates that continual pre-training is more efficient than training from scratch, as it requires fewer computational resources and achieves better performance. The study provides insights into the effectiveness of vocabulary expansion and parallel corpora in enhancing LLMs for non-English languages. The results show that Swallow achieves the highest performance in Japanese among all models developed in Japan as of December 2023.