14 Jun 2020 | Yu Cheng, Duo Wang, Pan Zhou, Member, IEEE, and Tao Zhang, Senior Member, IEEE
This paper reviews recent techniques for model compression and acceleration of deep neural networks (DNNs). DNNs have achieved great success in visual recognition tasks but are computationally expensive and memory-intensive, making them unsuitable for low-resource devices or applications with strict latency requirements. To address this, various methods have been developed to compact and accelerate DNNs without significantly reducing performance. These techniques are categorized into four main approaches: parameter pruning and quantization, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. Each method is analyzed for its performance, applications, advantages, and drawbacks. Recent successful methods, such as dynamic capacity networks and stochastic depth networks, are also discussed. The paper also surveys evaluation metrics, datasets, and benchmark efforts. It concludes by discussing remaining challenges and future directions for research in this area. Key techniques include pruning and quantizing model parameters, using low-rank approximations, designing compact convolutional filters, and knowledge distillation to train smaller models that mimic the behavior of larger ones. The paper highlights the importance of model compression for efficient deployment on portable devices and discusses the trade-offs between compression rates, computational efficiency, and model accuracy. It also addresses challenges such as hardware constraints, the need for better configuration strategies, and the impact of prior knowledge on model performance. Future work is suggested to focus on improving compression techniques, exploring new hardware-aware approaches, and expanding the application of model compression to a wider range of tasks and deep learning models.This paper reviews recent techniques for model compression and acceleration of deep neural networks (DNNs). DNNs have achieved great success in visual recognition tasks but are computationally expensive and memory-intensive, making them unsuitable for low-resource devices or applications with strict latency requirements. To address this, various methods have been developed to compact and accelerate DNNs without significantly reducing performance. These techniques are categorized into four main approaches: parameter pruning and quantization, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. Each method is analyzed for its performance, applications, advantages, and drawbacks. Recent successful methods, such as dynamic capacity networks and stochastic depth networks, are also discussed. The paper also surveys evaluation metrics, datasets, and benchmark efforts. It concludes by discussing remaining challenges and future directions for research in this area. Key techniques include pruning and quantizing model parameters, using low-rank approximations, designing compact convolutional filters, and knowledge distillation to train smaller models that mimic the behavior of larger ones. The paper highlights the importance of model compression for efficient deployment on portable devices and discusses the trade-offs between compression rates, computational efficiency, and model accuracy. It also addresses challenges such as hardware constraints, the need for better configuration strategies, and the impact of prior knowledge on model performance. Future work is suggested to focus on improving compression techniques, exploring new hardware-aware approaches, and expanding the application of model compression to a wider range of tasks and deep learning models.