[slides] PanGu-%24%5Cpi%24 Pro%3ARethinking Optimization and Architecture for Tiny Language Models

This paper addresses the challenge of deploying large language models (LLMs) on mobile devices due to high computational and memory costs. The authors focus on optimizing tiny language models with 1B parameters, exploring neural architecture, parameter initialization, and optimization strategies. Key findings include: 1. **Neural Architecture**: A compact tokenizer is essential for reducing parameter redundancy. Removing low-frequency vocabularies significantly improves performance without compromising representation capacity. Architecture tweaks, such as depth and width, also significantly impact performance, with deeper models generally achieving better results at the cost of inference speed. 2. **Parameter Initialization**: Inheriting parameters from larger models is effective, especially from layers near the beginning and end, which carry more significance. Data-driven learnable criteria outperform heuristic methods in selecting crucial parameters. 3. **Model Optimization**: Multiple-round training helps mitigate data forgetting issues, and a simple data refining strategy can improve learning on challenging examples while reducing training costs. Adjusting the batch size and learning rate is crucial, with a batch size smaller than 4M recommended for optimal performance. The authors develop PanGu-π-1B Pro and PanGu-π-1.5B Pro models, which achieve significant improvements over existing models. PanGu-π-1.5B Pro, with 16.67% fewer parameters, outperforms Qwen-1.8B and Phi2-2.7B, demonstrating superior performance in various benchmarks. The paper also discusses future directions, including hardware-friendly architectures and new parameter optimization techniques for tiny models.This paper addresses the challenge of deploying large language models (LLMs) on mobile devices due to high computational and memory costs. The authors focus on optimizing tiny language models with 1B parameters, exploring neural architecture, parameter initialization, and optimization strategies. Key findings include: 1. **Neural Architecture**: A compact tokenizer is essential for reducing parameter redundancy. Removing low-frequency vocabularies significantly improves performance without compromising representation capacity. Architecture tweaks, such as depth and width, also significantly impact performance, with deeper models generally achieving better results at the cost of inference speed. 2. **Parameter Initialization**: Inheriting parameters from larger models is effective, especially from layers near the beginning and end, which carry more significance. Data-driven learnable criteria outperform heuristic methods in selecting crucial parameters. 3. **Model Optimization**: Multiple-round training helps mitigate data forgetting issues, and a simple data refining strategy can improve learning on challenging examples while reducing training costs. Adjusting the batch size and learning rate is crucial, with a batch size smaller than 4M recommended for optimal performance. The authors develop PanGu-π-1B Pro and PanGu-π-1.5B Pro models, which achieve significant improvements over existing models. PanGu-π-1.5B Pro, with 16.67% fewer parameters, outperforms Qwen-1.8B and Phi2-2.7B, demonstrating superior performance in various benchmarks. The paper also discusses future directions, including hardware-friendly architectures and new parameter optimization techniques for tiny models.

Rethinking Optimization and Architecture for Tiny Language Models

6 Feb 2024 | Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang