A Survey of Quantization Methods for Efficient Neural Network Inference

A Survey of Quantization Methods for Efficient Neural Network Inference

21 Jun 2021 | Amir Gholami*, Sehoon Kim*, Zhen Dong*, Zhewei Yao*, Michael W. Mahoney, Kurt Keutzer
This paper surveys quantization methods for efficient neural network (NN) inference. Quantization is the process of mapping continuous real-valued numbers to a discrete set to minimize bit usage and maintain computational accuracy. It is particularly relevant for resource-constrained environments due to the high memory and computational demands of modern NNs. The paper discusses various approaches to quantizing NNs, including efficient model architectures, co-design with hardware, pruning, knowledge distillation, and quantization itself. Quantization has shown significant success in both training and inference, especially with low-precision formats like INT8. The paper also explores the relationship between quantization and neuroscience, noting that the brain may store information in a discrete form. The survey covers the history of quantization, basic concepts, advanced topics, and implications for hardware accelerators. It highlights the trade-offs between accuracy and efficiency in quantization, and discusses different quantization methods such as uniform and non-uniform quantization, as well as static and dynamic quantization. The paper also addresses fine-tuning methods like Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), and explores zero-shot quantization for scenarios where training data is unavailable. The survey concludes with a discussion of the challenges and opportunities in quantization research for NNs.This paper surveys quantization methods for efficient neural network (NN) inference. Quantization is the process of mapping continuous real-valued numbers to a discrete set to minimize bit usage and maintain computational accuracy. It is particularly relevant for resource-constrained environments due to the high memory and computational demands of modern NNs. The paper discusses various approaches to quantizing NNs, including efficient model architectures, co-design with hardware, pruning, knowledge distillation, and quantization itself. Quantization has shown significant success in both training and inference, especially with low-precision formats like INT8. The paper also explores the relationship between quantization and neuroscience, noting that the brain may store information in a discrete form. The survey covers the history of quantization, basic concepts, advanced topics, and implications for hardware accelerators. It highlights the trade-offs between accuracy and efficiency in quantization, and discusses different quantization methods such as uniform and non-uniform quantization, as well as static and dynamic quantization. The paper also addresses fine-tuning methods like Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), and explores zero-shot quantization for scenarios where training data is unavailable. The survey concludes with a discussion of the challenges and opportunities in quantization research for NNs.
Reach us at info@study.space
[slides and audio] A Survey of Quantization Methods for Efficient Neural Network Inference