This study introduces CT-LLM, a 2 billion parameter large language model (LLM) that prioritizes the Chinese language in its development. Unlike traditional LLMs, which are primarily trained on English datasets, CT-LLM is pre-trained on a comprehensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This unique composition enhances the model's proficiency in understanding and processing Chinese, a capability further refined through supervised fine-tuning (SFT) techniques.
The research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages. By open-sourcing the entire training process, including the Massive Appropriate Pretraining Chinese Corpus (MAP-CC) and the Chinese Hard Case Benchmark (CHC-Bench), the authors aim to foster further exploration and innovation in both academia and industry.
Key contributions of this work include:
1. **MAP-CC**: An open-source Chinese pretraining dataset with 800 billion tokens, offering high-quality Chinese pretraining data and effective data preparation methods.
2. **CHC-Bench**: A multidisciplinary Chinese hard cases instruction understanding and following benchmark.
3. **CT-LLM**: The first Chinese-centric LLM, pre-trained and fine-tuned primarily on Chinese corpora, showcasing significant insights into Chinese language ability and multilingual adaptability.
The study demonstrates CT-LLM's exceptional performance on the CHC-Bench, highlighting its adeptness in Chinese language tasks and versatility in English through SFT. The open-sourcing of the training process and datasets aims to promote a more inclusive and diverse landscape for future LLM developments, encouraging the exploration of models that better reflect global linguistic diversity.This study introduces CT-LLM, a 2 billion parameter large language model (LLM) that prioritizes the Chinese language in its development. Unlike traditional LLMs, which are primarily trained on English datasets, CT-LLM is pre-trained on a comprehensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This unique composition enhances the model's proficiency in understanding and processing Chinese, a capability further refined through supervised fine-tuning (SFT) techniques.
The research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages. By open-sourcing the entire training process, including the Massive Appropriate Pretraining Chinese Corpus (MAP-CC) and the Chinese Hard Case Benchmark (CHC-Bench), the authors aim to foster further exploration and innovation in both academia and industry.
Key contributions of this work include:
1. **MAP-CC**: An open-source Chinese pretraining dataset with 800 billion tokens, offering high-quality Chinese pretraining data and effective data preparation methods.
2. **CHC-Bench**: A multidisciplinary Chinese hard cases instruction understanding and following benchmark.
3. **CT-LLM**: The first Chinese-centric LLM, pre-trained and fine-tuned primarily on Chinese corpora, showcasing significant insights into Chinese language ability and multilingual adaptability.
The study demonstrates CT-LLM's exceptional performance on the CHC-Bench, highlighting its adeptness in Chinese language tasks and versatility in English through SFT. The open-sourcing of the training process and datasets aims to promote a more inclusive and diverse landscape for future LLM developments, encouraging the exploration of models that better reflect global linguistic diversity.