6 Apr 2019 | Kuan Wang*, Zhijian Liu*, Yujun Lin*, Ji Lin, and Song Han
HAQ: Hardware-Aware Automated Quantization with Mixed Precision is a framework that automatically determines the optimal bitwidth for each layer of a neural network based on hardware feedback, using reinforcement learning (RL). The framework leverages a hardware simulator to generate direct feedback signals (latency and energy) for the RL agent, rather than relying on proxy signals like FLOPs or model size. This allows HAQ to specialize the quantization policy for different neural network and hardware architectures, resulting in significant improvements in latency and energy efficiency with minimal accuracy loss compared to fixed-bitwidth quantization. HAQ reduces latency by 1.4-1.95× and energy consumption by 1.9×, while maintaining high accuracy. The framework reveals that optimal policies vary significantly across different hardware architectures and resource constraints, offering insights into both neural network and hardware design. HAQ uses a continuous action space to determine bitwidths, with the RL agent deciding bitwidths layer by layer based on hardware feedback. The framework also includes a reward function based on accuracy after retraining the quantized model. HAQ is evaluated on various hardware architectures, including edge and cloud accelerators, and demonstrates superior performance compared to conventional methods. The framework's ability to adapt to different hardware constraints and optimize for latency, energy, and model size makes it a valuable tool for efficient deep learning deployment.HAQ: Hardware-Aware Automated Quantization with Mixed Precision is a framework that automatically determines the optimal bitwidth for each layer of a neural network based on hardware feedback, using reinforcement learning (RL). The framework leverages a hardware simulator to generate direct feedback signals (latency and energy) for the RL agent, rather than relying on proxy signals like FLOPs or model size. This allows HAQ to specialize the quantization policy for different neural network and hardware architectures, resulting in significant improvements in latency and energy efficiency with minimal accuracy loss compared to fixed-bitwidth quantization. HAQ reduces latency by 1.4-1.95× and energy consumption by 1.9×, while maintaining high accuracy. The framework reveals that optimal policies vary significantly across different hardware architectures and resource constraints, offering insights into both neural network and hardware design. HAQ uses a continuous action space to determine bitwidths, with the RL agent deciding bitwidths layer by layer based on hardware feedback. The framework also includes a reward function based on accuracy after retraining the quantized model. HAQ is evaluated on various hardware architectures, including edge and cloud accelerators, and demonstrates superior performance compared to conventional methods. The framework's ability to adapt to different hardware constraints and optimize for latency, energy, and model size makes it a valuable tool for efficient deep learning deployment.