25 Aug 2017 | Aojun Zhou*, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen
This paper introduces Incremental Network Quantization (INQ), a novel method to convert any pre-trained full-precision convolutional neural network (CNN) into a low-precision version with weights constrained to powers of two or zero, achieving lossless accuracy. Unlike existing methods that suffer from significant accuracy loss, INQ uses three interdependent operations: weight partition, group-wise quantization, and re-training. Weight partition divides weights into two groups, with the first group forming a low-precision base and the second group being re-trained to compensate for quantization loss. These operations are iteratively applied to the latest re-trained group until all weights are quantized, enabling incremental quantization and accuracy enhancement. Extensive experiments on ImageNet using AlexNet, VGG-16, GoogleNet, and ResNets show that INQ achieves improved accuracy compared to 32-bit floating-point models. For example, ResNet-18 with 4-bit, 3-bit, and 2-bit ternary weights shows improved or similar accuracy to its 32-bit baseline. Combining INQ with network pruning further enhances performance. INQ's low-precision weights can be efficiently processed on hardware like FPGA using binary bit shift operations. The method is implemented in Caffe and available at https://github.com/Zhouaojun/IncrementalNetwork-Quantization. The results demonstrate that INQ achieves lossless quantization with 5-bit, 4-bit, 3-bit, and even 2-bit weights, outperforming existing methods in compression and accuracy. The code is available for further research and application.This paper introduces Incremental Network Quantization (INQ), a novel method to convert any pre-trained full-precision convolutional neural network (CNN) into a low-precision version with weights constrained to powers of two or zero, achieving lossless accuracy. Unlike existing methods that suffer from significant accuracy loss, INQ uses three interdependent operations: weight partition, group-wise quantization, and re-training. Weight partition divides weights into two groups, with the first group forming a low-precision base and the second group being re-trained to compensate for quantization loss. These operations are iteratively applied to the latest re-trained group until all weights are quantized, enabling incremental quantization and accuracy enhancement. Extensive experiments on ImageNet using AlexNet, VGG-16, GoogleNet, and ResNets show that INQ achieves improved accuracy compared to 32-bit floating-point models. For example, ResNet-18 with 4-bit, 3-bit, and 2-bit ternary weights shows improved or similar accuracy to its 32-bit baseline. Combining INQ with network pruning further enhances performance. INQ's low-precision weights can be efficiently processed on hardware like FPGA using binary bit shift operations. The method is implemented in Caffe and available at https://github.com/Zhouaojun/IncrementalNetwork-Quantization. The results demonstrate that INQ achieves lossless quantization with 5-bit, 4-bit, 3-bit, and even 2-bit weights, outperforming existing methods in compression and accuracy. The code is available for further research and application.