Rethinking Optimization and Architecture for Tiny Language Models

Rethinking Optimization and Architecture for Tiny Language Models

6 Feb 2024 | Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang
This paper presents a comprehensive study on optimizing and rethinking the architecture of tiny language models (TLMs). The authors focus on three key aspects: neural architecture, parameter initialization, and optimization strategies. They propose several design formulas that are empirically proven to be effective for TLMs, including tokenizer compression, architecture tweaking, parameter inheritance, and multiple-round training. The study is based on a 1B parameter TLM, and the results show significant improvements in performance. The authors train two models, PanGu-π-1B Pro and PanGu-π-1.5B Pro, on a 1.6T multilingual corpus. The results demonstrate that PanGu-π-1B Pro achieves an average improvement of 8.87 on benchmark evaluation sets, while PanGu-π-1.5B Pro outperforms several state-of-the-art models with larger parameter sizes. The code is available at https://github.com/YuchuanTian/RethinkTinyLM. The study investigates the impact of various architectural choices on TLM performance. It finds that a compact tokenizer, which removes low-frequency vocabularies, significantly reduces parameter usage without compromising performance. The authors also explore the effects of depth, width, and expansion rate of feed-forward networks (FFN) on model performance. They find that deeper models generally achieve better performance, but at the cost of slower inference speed. The study also examines the importance of parameter initialization, finding that inheriting parameters from larger models can improve performance and accelerate convergence. The authors propose a data-driven approach to parameter selection, which is more effective than heuristic methods. The study also investigates optimization strategies for TLMs, including the relationship between batch size and learning rate. The authors find that a moderate increment rate of batch size can improve performance without significantly affecting convergence speed. They also explore the benefits of multiple-round training, which helps reduce data forgetting and improves model performance. The study concludes that the proposed methods significantly improve the performance of TLMs, and that the PanGu-π-1.5B Pro model achieves state-of-the-art performance on various benchmarks. The authors recommend using a compact tokenizer, adjusting the model's depth and width, and employing parameter inheritance and multiple-round training to optimize TLMs.This paper presents a comprehensive study on optimizing and rethinking the architecture of tiny language models (TLMs). The authors focus on three key aspects: neural architecture, parameter initialization, and optimization strategies. They propose several design formulas that are empirically proven to be effective for TLMs, including tokenizer compression, architecture tweaking, parameter inheritance, and multiple-round training. The study is based on a 1B parameter TLM, and the results show significant improvements in performance. The authors train two models, PanGu-π-1B Pro and PanGu-π-1.5B Pro, on a 1.6T multilingual corpus. The results demonstrate that PanGu-π-1B Pro achieves an average improvement of 8.87 on benchmark evaluation sets, while PanGu-π-1.5B Pro outperforms several state-of-the-art models with larger parameter sizes. The code is available at https://github.com/YuchuanTian/RethinkTinyLM. The study investigates the impact of various architectural choices on TLM performance. It finds that a compact tokenizer, which removes low-frequency vocabularies, significantly reduces parameter usage without compromising performance. The authors also explore the effects of depth, width, and expansion rate of feed-forward networks (FFN) on model performance. They find that deeper models generally achieve better performance, but at the cost of slower inference speed. The study also examines the importance of parameter initialization, finding that inheriting parameters from larger models can improve performance and accelerate convergence. The authors propose a data-driven approach to parameter selection, which is more effective than heuristic methods. The study also investigates optimization strategies for TLMs, including the relationship between batch size and learning rate. The authors find that a moderate increment rate of batch size can improve performance without significantly affecting convergence speed. They also explore the benefits of multiple-round training, which helps reduce data forgetting and improves model performance. The study concludes that the proposed methods significantly improve the performance of TLMs, and that the PanGu-π-1.5B Pro model achieves state-of-the-art performance on various benchmarks. The authors recommend using a compact tokenizer, adjusting the model's depth and width, and employing parameter inheritance and multiple-round training to optimize TLMs.
Reach us at info@study.space
[slides and audio] PanGu-%24%5Cpi%24 Pro%3ARethinking Optimization and Architecture for Tiny Language Models