MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

3 Jun 2024 | Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihao Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
MiniCPM is a series of small language models (SLMs) with 1.2B and 2.4B non-embedding parameters, which perform well in their respective categories and are comparable to 7B-13B large language models (LLMs). The paper introduces MiniCPM, which demonstrates scalability in both model and data dimensions, enabling efficient study of data-model scaling laws without extensive retraining. The WSD learning rate scheduler (LRS) is introduced, which allows for continuous training and domain adaptation, and enables the study of scaling laws with linear effort on the model axis and negligible effort on the data axis. The paper also introduces the MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K, which show excellent performance in various tasks. The paper discusses the training dynamics of MiniCPM, the scaling law, and the performance of the MiniCPM family on benchmark datasets. The results show that MiniCPM outperforms other SLMs in several tasks, and the WSD LRS enables efficient training and scaling. The paper concludes that MiniCPM represents a new stage in the development of small language models, demonstrating the potential of SLMs and advocating for a more scientific and sustainable approach to scaling LLMs.MiniCPM is a series of small language models (SLMs) with 1.2B and 2.4B non-embedding parameters, which perform well in their respective categories and are comparable to 7B-13B large language models (LLMs). The paper introduces MiniCPM, which demonstrates scalability in both model and data dimensions, enabling efficient study of data-model scaling laws without extensive retraining. The WSD learning rate scheduler (LRS) is introduced, which allows for continuous training and domain adaptation, and enables the study of scaling laws with linear effort on the model axis and negligible effort on the data axis. The paper also introduces the MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K, which show excellent performance in various tasks. The paper discusses the training dynamics of MiniCPM, the scaling law, and the performance of the MiniCPM family on benchmark datasets. The results show that MiniCPM outperforms other SLMs in several tasks, and the WSD LRS enables efficient training and scaling. The paper concludes that MiniCPM represents a new stage in the development of small language models, demonstrating the potential of SLMs and advocating for a more scientific and sustainable approach to scaling LLMs.
Reach us at info@study.space
Understanding MiniCPM%3A Unveiling the Potential of Small Language Models with Scalable Training Strategies