Quantizing deep convolutional networks for efficient inference: A whitepaper

Quantizing deep convolutional networks for efficient inference: A whitepaper

June 2018 | Raghuraman Krishnamoorthi
This whitepaper discusses techniques for quantizing deep convolutional networks for efficient inference with integer weights and activations. It covers quantizer design, quantized inference performance and accuracy, training best practices, model architecture recommendations, run-time measurements, and neural network accelerator recommendations. The paper presents several quantization schemes, including uniform affine quantizer, uniform symmetric quantizer, and stochastic quantizer, and discusses their impact on model accuracy and performance. It also explores post-training quantization and quantization-aware training, showing that quantization-aware training can significantly improve accuracy. The paper highlights that per-channel quantization of weights and per-layer quantization of activations are preferred for hardware acceleration and kernel optimization. It also recommends that future processors and hardware accelerators support precisions of 4, 8, and 16 bits. The paper concludes that quantizing deep convolutional networks can significantly reduce model size, improve inference speed, and lower power consumption, making it an essential technique for deploying deep learning models on edge devices.This whitepaper discusses techniques for quantizing deep convolutional networks for efficient inference with integer weights and activations. It covers quantizer design, quantized inference performance and accuracy, training best practices, model architecture recommendations, run-time measurements, and neural network accelerator recommendations. The paper presents several quantization schemes, including uniform affine quantizer, uniform symmetric quantizer, and stochastic quantizer, and discusses their impact on model accuracy and performance. It also explores post-training quantization and quantization-aware training, showing that quantization-aware training can significantly improve accuracy. The paper highlights that per-channel quantization of weights and per-layer quantization of activations are preferred for hardware acceleration and kernel optimization. It also recommends that future processors and hardware accelerators support precisions of 4, 8, and 16 bits. The paper concludes that quantizing deep convolutional networks can significantly reduce model size, improve inference speed, and lower power consumption, making it an essential technique for deploying deep learning models on edge devices.
Reach us at info@study.space