A Survey on Transformer Compression

A Survey on Transformer Compression

7 Apr 2024 | Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao Fellow, IEEE
This survey provides a comprehensive review of recent compression methods for Transformer-based models, focusing on their application to both natural language processing (NLP) and computer vision (CV) domains. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design. In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods and discuss further directions in this domain. Transformer models have become the prevailing choice across various domains, including NLP and CV, due to their strong scalability. However, their large sizes pose challenges for practical deployment. Model compression is an effective strategy for mitigating the development costs associated with Transformer models. This approach, grounded in the principle of reducing redundancy, encompasses various categories, including pruning, quantization, knowledge distillation, and efficient architecture design. Quantization reduces the development cost by representing model weights and intermediate features with lower bits. Knowledge distillation serves as a training strategy, transferring knowledge from a large model (teacher) to a smaller model (student). Efficient architecture design involves creating models with reduced computational complexity, such as Mamba, RetNet, RWKV, etc. Combining different methods enables extreme compression. For Transformer models, their compression strategies exhibit distinct characteristics. Unlike other architectures such as CNN or RNN, the Transformer features a unique design with alternative attention and FFN modules. The efficiency of compression methods becomes especially important for such large models. Due to the high computational cost of large models, it is usually unaffordable to retrain the whole model on the original training set. Some training-efficient methods like post-training compression are preferable. This survey aims to comprehensively investigate how to compress these Transformer models and categorize the methods by quantization, knowledge distillation, pruning, efficient architecture design, etc. In each category, we investigate the compression methods for NLP and CV domains, respectively. We also discuss the relationship between different compression methods and outline some future research directions.This survey provides a comprehensive review of recent compression methods for Transformer-based models, focusing on their application to both natural language processing (NLP) and computer vision (CV) domains. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design. In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods and discuss further directions in this domain. Transformer models have become the prevailing choice across various domains, including NLP and CV, due to their strong scalability. However, their large sizes pose challenges for practical deployment. Model compression is an effective strategy for mitigating the development costs associated with Transformer models. This approach, grounded in the principle of reducing redundancy, encompasses various categories, including pruning, quantization, knowledge distillation, and efficient architecture design. Quantization reduces the development cost by representing model weights and intermediate features with lower bits. Knowledge distillation serves as a training strategy, transferring knowledge from a large model (teacher) to a smaller model (student). Efficient architecture design involves creating models with reduced computational complexity, such as Mamba, RetNet, RWKV, etc. Combining different methods enables extreme compression. For Transformer models, their compression strategies exhibit distinct characteristics. Unlike other architectures such as CNN or RNN, the Transformer features a unique design with alternative attention and FFN modules. The efficiency of compression methods becomes especially important for such large models. Due to the high computational cost of large models, it is usually unaffordable to retrain the whole model on the original training set. Some training-efficient methods like post-training compression are preferable. This survey aims to comprehensively investigate how to compress these Transformer models and categorize the methods by quantization, knowledge distillation, pruning, efficient architecture design, etc. In each category, we investigate the compression methods for NLP and CV domains, respectively. We also discuss the relationship between different compression methods and outline some future research directions.
Reach us at info@study.space
Understanding A Survey on Transformer Compression