15 Dec 2017 | Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko
This paper proposes a quantization scheme that enables efficient integer-only arithmetic inference for neural networks, which can be implemented more efficiently than floating-point inference on commonly available integer-only hardware. The scheme also includes a training procedure to preserve end-to-end model accuracy post-quantization. The proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.
The paper introduces a quantization scheme that quantizes both weights and activations as 8-bit integers, and just a few parameters (bias vectors) as 32-bit integers. It provides a quantized inference framework that is efficiently implementable on integer-arithmetic-only hardware such as the Qualcomm Hexagon, and describes an efficient, accurate implementation on ARM NEON. It also provides a quantized training framework co-designed with the quantized inference to minimize the loss of accuracy from quantization on real models.
The paper applies these frameworks to efficient classification and detection systems based on MobileNets and provides benchmark results on popular ARM CPUs that show significant improvements in the latency-vs-accuracy tradeoffs for state-of-the-art MobileNet architectures, demonstrated in ImageNet classification, COCO object detection, and other tasks.
The paper also discusses training with simulated quantization, which simulates quantization effects in the forward pass of training. This approach allows for more accurate training by simulating the quantization process during training, which helps to restore model accuracy to near-identical levels as the original. The paper also discusses the use of batch normalization folding, which allows for more efficient inference by folding batch normalization parameters into the weights and biases.
The paper presents experiments that demonstrate the effectiveness of quantized training and the improved latency-vs-accuracy tradeoff of quantized models on common hardware. The experiments show that quantized models achieve higher accuracies than floating-point models given the same runtime budget, and that the accuracy gap is quite substantial for certain hardware. The paper also discusses the performance of quantized models on COCO and face detection tasks, showing that quantized models can achieve significant improvements in latency while maintaining high accuracy. The paper concludes that integer-arithmetic-only inference could be a key enabler that propels visual recognition technologies into the real-time and low-end phone market.This paper proposes a quantization scheme that enables efficient integer-only arithmetic inference for neural networks, which can be implemented more efficiently than floating-point inference on commonly available integer-only hardware. The scheme also includes a training procedure to preserve end-to-end model accuracy post-quantization. The proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.
The paper introduces a quantization scheme that quantizes both weights and activations as 8-bit integers, and just a few parameters (bias vectors) as 32-bit integers. It provides a quantized inference framework that is efficiently implementable on integer-arithmetic-only hardware such as the Qualcomm Hexagon, and describes an efficient, accurate implementation on ARM NEON. It also provides a quantized training framework co-designed with the quantized inference to minimize the loss of accuracy from quantization on real models.
The paper applies these frameworks to efficient classification and detection systems based on MobileNets and provides benchmark results on popular ARM CPUs that show significant improvements in the latency-vs-accuracy tradeoffs for state-of-the-art MobileNet architectures, demonstrated in ImageNet classification, COCO object detection, and other tasks.
The paper also discusses training with simulated quantization, which simulates quantization effects in the forward pass of training. This approach allows for more accurate training by simulating the quantization process during training, which helps to restore model accuracy to near-identical levels as the original. The paper also discusses the use of batch normalization folding, which allows for more efficient inference by folding batch normalization parameters into the weights and biases.
The paper presents experiments that demonstrate the effectiveness of quantized training and the improved latency-vs-accuracy tradeoff of quantized models on common hardware. The experiments show that quantized models achieve higher accuracies than floating-point models given the same runtime budget, and that the accuracy gap is quite substantial for certain hardware. The paper also discusses the performance of quantized models on COCO and face detection tasks, showing that quantized models can achieve significant improvements in latency while maintaining high accuracy. The paper concludes that integer-arithmetic-only inference could be a key enabler that propels visual recognition technologies into the real-time and low-end phone market.