15 Dec 2017 | Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko
The paper proposes a quantization scheme that allows neural network inference to be performed using only integer arithmetic, which can be more efficiently implemented on common integer-only hardware compared to floating-point inference. The authors also co-design a training procedure to preserve end-to-end model accuracy post-quantization. The proposed scheme improves the trade-off between accuracy and on-device latency, showing significant improvements even on MobileNets, a model family known for its run-time efficiency. The improvements are demonstrated in ImageNet classification and COCO detection on popular CPUs. The paper addresses the limitations of existing quantization approaches by providing a more meaningful baseline architecture and verifiable efficiency improvements on real hardware. The authors achieve this by quantizing both weights and activations as 8-bit integers and using a few parameters as 32-bit integers. They also provide a quantized inference framework that is efficiently implementable on integer-arithmetic-only hardware and a quantized training framework to minimize accuracy loss from quantization. The results show that the proposed quantization scheme significantly reduces latency while maintaining or improving accuracy, making it suitable for real-time and low-end mobile devices.The paper proposes a quantization scheme that allows neural network inference to be performed using only integer arithmetic, which can be more efficiently implemented on common integer-only hardware compared to floating-point inference. The authors also co-design a training procedure to preserve end-to-end model accuracy post-quantization. The proposed scheme improves the trade-off between accuracy and on-device latency, showing significant improvements even on MobileNets, a model family known for its run-time efficiency. The improvements are demonstrated in ImageNet classification and COCO detection on popular CPUs. The paper addresses the limitations of existing quantization approaches by providing a more meaningful baseline architecture and verifiable efficiency improvements on real hardware. The authors achieve this by quantizing both weights and activations as 8-bit integers and using a few parameters as 32-bit integers. They also provide a quantized inference framework that is efficiently implementable on integer-arithmetic-only hardware and a quantized training framework to minimize accuracy loss from quantization. The results show that the proposed quantization scheme significantly reduces latency while maintaining or improving accuracy, making it suitable for real-time and low-end mobile devices.