MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

3 Jun 2024 | Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
The paper introduces MiniCPM, a series of Small Language Models (SLMs) with 2.4B and 1.2B non-embedding parameters, designed to be resource-efficient and scalable. These models excel in their respective categories and demonstrate capabilities comparable to 7B-13B Large Language Models (LLMs). The authors focus on SLMs to explore their potential and scalability in both model and data dimensions. They employ extensive model wind tunnel experiments for stable and optimal scaling and introduce a Warmup-Stable-Decay (WSD) learning rate scheduler for continuous training and domain adaptation. The WSD scheduler enables efficient study of the data-model scaling law, showing a higher compute optimal data-model ratio compared to Chinchilla Optimal. The MiniCPM family includes MiniCPM-DPO, MiniCPM-128K, and MiniCPM-MoE, which further enhance the foundation of SLMs in diverse applications. The paper also discusses the scaling law, training dynamics, and the benefits of introducing high-quality data during the decay stage of pre-training. Overall, MiniCPM represents a significant advancement in the development of small language models, advocating for a more sustainable approach to scaling up LLMs.The paper introduces MiniCPM, a series of Small Language Models (SLMs) with 2.4B and 1.2B non-embedding parameters, designed to be resource-efficient and scalable. These models excel in their respective categories and demonstrate capabilities comparable to 7B-13B Large Language Models (LLMs). The authors focus on SLMs to explore their potential and scalability in both model and data dimensions. They employ extensive model wind tunnel experiments for stable and optimal scaling and introduce a Warmup-Stable-Decay (WSD) learning rate scheduler for continuous training and domain adaptation. The WSD scheduler enables efficient study of the data-model scaling law, showing a higher compute optimal data-model ratio compared to Chinchilla Optimal. The MiniCPM family includes MiniCPM-DPO, MiniCPM-128K, and MiniCPM-MoE, which further enhance the foundation of SLMs in diverse applications. The paper also discusses the scaling law, training dynamics, and the benefits of introducing high-quality data during the decay stage of pre-training. Overall, MiniCPM represents a significant advancement in the development of small language models, advocating for a more sustainable approach to scaling up LLMs.
Reach us at info@study.space