[slides] A Survey on Transformer Compression

This survey provides a comprehensive review of recent compression methods for Transformer-based models, focusing on their application in natural language processing (NLP) and computer vision (CV). The methods are categorized into pruning, quantization, knowledge distillation, and efficient architecture design. Each category discusses compression techniques for both NLP and CV tasks, highlighting common underlying principles. The survey also explores the relationship between different compression methods and outlines future research directions. Transformer models, known for their scalability and versatility, are widely used in NLP and CV. However, their large sizes pose challenges for practical deployment due to high memory and computational costs. Model compression methods, such as pruning, quantization, knowledge distillation, and efficient architecture design, are essential to reduce these costs. Pruning removes redundant components, quantization reduces precision, knowledge distillation transfers knowledge from large to smaller models, and efficient architecture design optimizes the model structure. The survey covers various techniques within each category, including specific methods for large Transformer models. For example, pruning methods like CoFi and Block Pruning aim to achieve significant speed-ups with limited accuracy loss. Quantization methods, such as PTQ and QAT, focus on optimizing quantization parameters to minimize performance degradation. Knowledge distillation methods, including logits-based and hint-based approaches, help transfer knowledge from large models to smaller ones. Efficient architecture design methods, like Mamba and RetNet, aim to reduce computational complexity. The survey also discusses the challenges and limitations of each method, such as the difficulty of quantization in Transformers due to extreme distributions and outliers. It highlights the importance of addressing these challenges to build more accurate and low-bit Transformers. Finally, the survey concludes by discussing the future research directions in model compression, emphasizing the need for more efficient and accurate methods to handle large Transformer models.This survey provides a comprehensive review of recent compression methods for Transformer-based models, focusing on their application in natural language processing (NLP) and computer vision (CV). The methods are categorized into pruning, quantization, knowledge distillation, and efficient architecture design. Each category discusses compression techniques for both NLP and CV tasks, highlighting common underlying principles. The survey also explores the relationship between different compression methods and outlines future research directions. Transformer models, known for their scalability and versatility, are widely used in NLP and CV. However, their large sizes pose challenges for practical deployment due to high memory and computational costs. Model compression methods, such as pruning, quantization, knowledge distillation, and efficient architecture design, are essential to reduce these costs. Pruning removes redundant components, quantization reduces precision, knowledge distillation transfers knowledge from large to smaller models, and efficient architecture design optimizes the model structure. The survey covers various techniques within each category, including specific methods for large Transformer models. For example, pruning methods like CoFi and Block Pruning aim to achieve significant speed-ups with limited accuracy loss. Quantization methods, such as PTQ and QAT, focus on optimizing quantization parameters to minimize performance degradation. Knowledge distillation methods, including logits-based and hint-based approaches, help transfer knowledge from large models to smaller ones. Efficient architecture design methods, like Mamba and RetNet, aim to reduce computational complexity. The survey also discusses the challenges and limitations of each method, such as the difficulty of quantization in Transformers due to extreme distributions and outliers. It highlights the importance of addressing these challenges to build more accurate and low-bit Transformers. Finally, the survey concludes by discussing the future research directions in model compression, emphasizing the need for more efficient and accurate methods to handle large Transformer models.

A Survey on Transformer Compression

7 Apr 2024 | Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao Fellow, IEEE