HAQ: Hardware-Aware Automated Quantization with Mixed Precision

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

6 Apr 2019 | Kuan Wang*, Zhijian Liu*, Yujun Lin*, Ji Lin, and Song Han
The paper introduces the Hardware-Aware Automated Quantization (HAQ) framework, which leverages reinforcement learning to automatically determine the optimal bitwidth for each layer in a deep neural network (DNN) for mixed precision inference. HAQ addresses the challenge of finding the best bitwidth for different layers by considering the trade-offs between accuracy, latency, energy, and model size. Unlike conventional quantization methods that use a fixed bitwidth for all layers, HAQ employs a hardware simulator to generate direct feedback signals (latency and energy) for the reinforcement learning agent, allowing it to explore the vast design space more effectively. The framework is fully automated and can specialize the quantization policy for different neural network and hardware architectures. Experiments on MobileNet-V1 and MobileNet-V2 show that HAQ reduces latency by 1.4-1.95× and energy consumption by 1.9× with minimal accuracy loss compared to fixed bitwidth quantization. The framework also reveals that optimal policies vary significantly across different hardware architectures, providing valuable insights for both neural network and hardware architecture design.The paper introduces the Hardware-Aware Automated Quantization (HAQ) framework, which leverages reinforcement learning to automatically determine the optimal bitwidth for each layer in a deep neural network (DNN) for mixed precision inference. HAQ addresses the challenge of finding the best bitwidth for different layers by considering the trade-offs between accuracy, latency, energy, and model size. Unlike conventional quantization methods that use a fixed bitwidth for all layers, HAQ employs a hardware simulator to generate direct feedback signals (latency and energy) for the reinforcement learning agent, allowing it to explore the vast design space more effectively. The framework is fully automated and can specialize the quantization policy for different neural network and hardware architectures. Experiments on MobileNet-V1 and MobileNet-V2 show that HAQ reduces latency by 1.4-1.95× and energy consumption by 1.9× with minimal accuracy loss compared to fixed bitwidth quantization. The framework also reveals that optimal policies vary significantly across different hardware architectures, providing valuable insights for both neural network and hardware architecture design.
Reach us at info@study.space