MixDQ is a novel mixed-precision quantization method designed to address the challenges of memory efficiency and operational complexity in few-step text-to-image diffusion models. These models, which generate high-quality images from textual prompts with only a few denoising steps, have significant memory consumption (5-10GB) that limits their practical deployment on mobile devices. Post-Training Quantization (PTQ) is an effective method to reduce memory and computational costs, but existing methods struggle with preserving visual quality and text alignment in few-step diffusion models.
MixDQ identifies that quantization is bottlenecked by highly sensitive layers, particularly those related to text embeddings. It introduces a specialized Begin-Of-Sentence (BOS)-aware quantization to handle these layers, reducing quantization errors. Additionally, MixDQ proposes a metric-decoupled sensitivity analysis to accurately estimate the impact of quantization on both image quality and content, allowing for more precise bit-width allocation. An integer-programming-based method is developed to optimize the mixed-precision configuration, achieving significant memory savings and latency improvements.
In experiments, MixDQ achieves W3.66A16 and W4A8 quantization with minimal degradation in visual quality and text alignment compared to FP16. It reduces model size and memory costs by 3-4× and achieves a 1.5× latency speedup on Nvidia GPUs. The method also addresses the issue of image-text alignment, ensuring that generated images closely follow the text instructions.
MixDQ's contributions include highlighting the challenges of quantizing few-step diffusion models, designing a specialized quantization method, and evaluating its performance in both weight-only and normal weight and activation quantization settings. The method demonstrates superior performance and efficiency compared to existing quantization techniques, making it a valuable tool for future research and applications in generative models.MixDQ is a novel mixed-precision quantization method designed to address the challenges of memory efficiency and operational complexity in few-step text-to-image diffusion models. These models, which generate high-quality images from textual prompts with only a few denoising steps, have significant memory consumption (5-10GB) that limits their practical deployment on mobile devices. Post-Training Quantization (PTQ) is an effective method to reduce memory and computational costs, but existing methods struggle with preserving visual quality and text alignment in few-step diffusion models.
MixDQ identifies that quantization is bottlenecked by highly sensitive layers, particularly those related to text embeddings. It introduces a specialized Begin-Of-Sentence (BOS)-aware quantization to handle these layers, reducing quantization errors. Additionally, MixDQ proposes a metric-decoupled sensitivity analysis to accurately estimate the impact of quantization on both image quality and content, allowing for more precise bit-width allocation. An integer-programming-based method is developed to optimize the mixed-precision configuration, achieving significant memory savings and latency improvements.
In experiments, MixDQ achieves W3.66A16 and W4A8 quantization with minimal degradation in visual quality and text alignment compared to FP16. It reduces model size and memory costs by 3-4× and achieves a 1.5× latency speedup on Nvidia GPUs. The method also addresses the issue of image-text alignment, ensuring that generated images closely follow the text instructions.
MixDQ's contributions include highlighting the challenges of quantizing few-step diffusion models, designing a specialized quantization method, and evaluating its performance in both weight-only and normal weight and activation quantization settings. The method demonstrates superior performance and efficiency compared to existing quantization techniques, making it a valuable tool for future research and applications in generative models.