Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

10 Jul 2024 | Xinrun Du, Zhiouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Wenhui Chen, Ge Zhang
This study introduces CT-LLM, a 2B large language model (LLM) that prioritizes the Chinese language in its development. CT-LLM is pre-trained on a comprehensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition enables the model to excel in Chinese language tasks and demonstrates proficiency in English through supervised fine-tuning (SFT). The research challenges the traditional approach of training LLMs primarily on English corpora, promoting a more inclusive and diverse approach to language model training. The study also introduces the Massive Appropriate Pretraining Chinese Corpus (MAP-CC) and the Chinese Hard Case Benchmark (CHC-Bench) to evaluate the model's capabilities in complex tasks. CT-LLM's key contributions include providing a high-quality Chinese corpus, addressing biases, and advancing Chinese-focused LLMs. The open-sourcing of the training process and datasets encourages further exploration and innovation in both academia and industry. The model's performance on various benchmarks, including MMLU and COPA, demonstrates its strong capabilities in language understanding, reasoning, and domain-specific knowledge. The study also evaluates the model's safety and alignment with human preferences, showing its effectiveness in generating safe and helpful responses. Overall, CT-LLM represents a significant advancement in the development of Chinese-centric large language models.This study introduces CT-LLM, a 2B large language model (LLM) that prioritizes the Chinese language in its development. CT-LLM is pre-trained on a comprehensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition enables the model to excel in Chinese language tasks and demonstrates proficiency in English through supervised fine-tuning (SFT). The research challenges the traditional approach of training LLMs primarily on English corpora, promoting a more inclusive and diverse approach to language model training. The study also introduces the Massive Appropriate Pretraining Chinese Corpus (MAP-CC) and the Chinese Hard Case Benchmark (CHC-Bench) to evaluate the model's capabilities in complex tasks. CT-LLM's key contributions include providing a high-quality Chinese corpus, addressing biases, and advancing Chinese-focused LLMs. The open-sourcing of the training process and datasets encourages further exploration and innovation in both academia and industry. The model's performance on various benchmarks, including MMLU and COPA, demonstrates its strong capabilities in language understanding, reasoning, and domain-specific knowledge. The study also evaluates the model's safety and alignment with human preferences, showing its effectiveness in generating safe and helpful responses. Overall, CT-LLM represents a significant advancement in the development of Chinese-centric large language models.
Reach us at info@study.space
[slides] Chinese Tiny LLM%3A Pretraining a Chinese-Centric Large Language Model | StudySpace