[slides and audio] Quantizing deep convolutional networks for efficient inference%3A A whitepaper

This whitepaper by Raghuraman Krishnamoorthi discusses techniques for quantizing convolutional neural networks (CNNs) to reduce model size, inference time, and power consumption. The paper covers various quantizer designs, including uniform affine quantizers, uniform symmetric quantizers, and stochastic quantizers, and their impact on performance and accuracy. It also explores post-training quantization methods, such as weight-only quantization and joint weight-activation quantization, and quantization-aware training techniques. The paper highlights the benefits of per-channel quantization and the importance of batch normalization in quantization. It provides experimental results demonstrating that quantized models can achieve classification accuracies within 2% of floating-point models for a wide range of CNN architectures. Additionally, the paper offers recommendations for training best practices, model architecture choices, and neural network accelerator enhancements to fully leverage the benefits of quantized networks. The conclusions emphasize the potential of quantization for improving inference efficiency and power efficiency in edge devices.This whitepaper by Raghuraman Krishnamoorthi discusses techniques for quantizing convolutional neural networks (CNNs) to reduce model size, inference time, and power consumption. The paper covers various quantizer designs, including uniform affine quantizers, uniform symmetric quantizers, and stochastic quantizers, and their impact on performance and accuracy. It also explores post-training quantization methods, such as weight-only quantization and joint weight-activation quantization, and quantization-aware training techniques. The paper highlights the benefits of per-channel quantization and the importance of batch normalization in quantization. It provides experimental results demonstrating that quantized models can achieve classification accuracies within 2% of floating-point models for a wide range of CNN architectures. Additionally, the paper offers recommendations for training best practices, model architecture choices, and neural network accelerator enhancements to fully leverage the benefits of quantized networks. The conclusions emphasize the potential of quantization for improving inference efficiency and power efficiency in edge devices.

Quantizing deep convolutional networks for efficient inference: A whitepaper

June 2018 | Raghuraman Krishnamoorthi