This paper proposes PACT, a novel activation quantization technique that enables neural networks to work well with ultra low precision weights and activations without significant accuracy degradation. PACT introduces a parameterized clipping level, α, which is optimized during training to find the right quantization scale. This technique allows quantizing activations to arbitrary bit precisions while achieving better accuracy compared to existing state-of-the-art quantization schemes. The authors show that both weights and activations can be quantized to 4-bits of precision while maintaining accuracy comparable to full precision networks across various models and datasets. They also demonstrate that using reduced-precision computational units in hardware can lead to super-linear improvements in inference performance due to reduced area of accelerator compute engines and the ability to retain quantized model and activation data in on-chip memories.
The main contributions of this work include: (1) PACT, a new activation quantization scheme that automatically optimizes the quantization scale during model training; (2) quantitative results showing PACT's effectiveness on a range of models and datasets; and (3) system performance analysis demonstrating trade-offs in hardware complexity for different bit representations versus model accuracy. The paper also compares PACT with existing quantization schemes, showing that PACT achieves lower accuracy degradation for both weights and activations, even at very low bit precisions. The results demonstrate that PACT can achieve full-precision accuracy with 4-bit precision for both weights and activations, which is the lowest bit precision ever reported for achieving near full-precision accuracy. Additionally, the paper shows that using reduced-precision MAC units can significantly improve overall system performance by increasing the number of accelerator cores in the same area.This paper proposes PACT, a novel activation quantization technique that enables neural networks to work well with ultra low precision weights and activations without significant accuracy degradation. PACT introduces a parameterized clipping level, α, which is optimized during training to find the right quantization scale. This technique allows quantizing activations to arbitrary bit precisions while achieving better accuracy compared to existing state-of-the-art quantization schemes. The authors show that both weights and activations can be quantized to 4-bits of precision while maintaining accuracy comparable to full precision networks across various models and datasets. They also demonstrate that using reduced-precision computational units in hardware can lead to super-linear improvements in inference performance due to reduced area of accelerator compute engines and the ability to retain quantized model and activation data in on-chip memories.
The main contributions of this work include: (1) PACT, a new activation quantization scheme that automatically optimizes the quantization scale during model training; (2) quantitative results showing PACT's effectiveness on a range of models and datasets; and (3) system performance analysis demonstrating trade-offs in hardware complexity for different bit representations versus model accuracy. The paper also compares PACT with existing quantization schemes, showing that PACT achieves lower accuracy degradation for both weights and activations, even at very low bit precisions. The results demonstrate that PACT can achieve full-precision accuracy with 4-bit precision for both weights and activations, which is the lowest bit precision ever reported for achieving near full-precision accuracy. Additionally, the paper shows that using reduced-precision MAC units can significantly improve overall system performance by increasing the number of accelerator cores in the same area.