[slides and audio] MiniCPM%3A Unveiling the Potential of Small Language Models with Scalable Training Strategies

The paper introduces MiniCPM, a series of Small Language Models (SLMs) with 2.4B and 1.2B non-embedding parameters, designed to be resource-efficient and scalable. These models excel in their respective categories and demonstrate capabilities comparable to 7B-13B Large Language Models (LLMs). The authors focus on SLMs to explore their potential and scalability in both model and data dimensions. They employ extensive model wind tunnel experiments for stable and optimal scaling and introduce a Warmup-Stable-Decay (WSD) learning rate scheduler for continuous training and domain adaptation. The WSD scheduler enables efficient study of the data-model scaling law, showing a higher compute optimal data-model ratio compared to Chinchilla Optimal. The MiniCPM family includes MiniCPM-DPO, MiniCPM-128K, and MiniCPM-MoE, which further enhance the foundation of SLMs in diverse applications. The paper also discusses the scaling law, training dynamics, and the benefits of introducing high-quality data during the decay stage of pre-training. Overall, MiniCPM represents a significant advancement in the development of small language models, advocating for a more sustainable approach to scaling up LLMs.The paper introduces MiniCPM, a series of Small Language Models (SLMs) with 2.4B and 1.2B non-embedding parameters, designed to be resource-efficient and scalable. These models excel in their respective categories and demonstrate capabilities comparable to 7B-13B Large Language Models (LLMs). The authors focus on SLMs to explore their potential and scalability in both model and data dimensions. They employ extensive model wind tunnel experiments for stable and optimal scaling and introduce a Warmup-Stable-Decay (WSD) learning rate scheduler for continuous training and domain adaptation. The WSD scheduler enables efficient study of the data-model scaling law, showing a higher compute optimal data-model ratio compared to Chinchilla Optimal. The MiniCPM family includes MiniCPM-DPO, MiniCPM-128K, and MiniCPM-MoE, which further enhance the foundation of SLMs in diverse applications. The paper also discusses the scaling law, training dynamics, and the benefits of introducing high-quality data during the decay stage of pre-training. Overall, MiniCPM represents a significant advancement in the development of small language models, advocating for a more sustainable approach to scaling up LLMs.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies