6 Jun 2024 | Yang Sui, Yanyu Li, Anil Kag, Yerlan Idelbayev, Junli Cao, Ju Hu, Dhritiman Sagar, Bo Yuan, Sergey Tulyakov, Jian Ren
BitsFusion is a novel weight quantization method that reduces the size of the UNet in Stable Diffusion v1.5 to 1.99 bits, achieving a 7.9× smaller model size while maintaining or improving generation quality. The method involves assigning optimal bits to each layer based on quantization error analysis, initializing the quantized model for better performance, and improving the training strategy to reduce quantization error. The approach includes a two-stage training pipeline: Stage-I uses distillation to train the quantized model with a full-precision teacher, and Stage-II fine-tunes the model using noise prediction. The method also employs a quantization error-aware time step sampling strategy to enhance performance. Extensive evaluations on benchmark datasets and human evaluations show that the 1.99-bit model outperforms the full-precision model in terms of generation quality and text-image alignment. The work addresses the challenges of quantizing large-scale diffusion models, including the need for fair evaluation and the effectiveness of low-bit quantization. The results demonstrate that BitsFusion achieves significant storage savings while maintaining high-quality image generation.BitsFusion is a novel weight quantization method that reduces the size of the UNet in Stable Diffusion v1.5 to 1.99 bits, achieving a 7.9× smaller model size while maintaining or improving generation quality. The method involves assigning optimal bits to each layer based on quantization error analysis, initializing the quantized model for better performance, and improving the training strategy to reduce quantization error. The approach includes a two-stage training pipeline: Stage-I uses distillation to train the quantized model with a full-precision teacher, and Stage-II fine-tunes the model using noise prediction. The method also employs a quantization error-aware time step sampling strategy to enhance performance. Extensive evaluations on benchmark datasets and human evaluations show that the 1.99-bit model outperforms the full-precision model in terms of generation quality and text-image alignment. The work addresses the challenges of quantizing large-scale diffusion models, including the need for fair evaluation and the effectiveness of low-bit quantization. The results demonstrate that BitsFusion achieves significant storage savings while maintaining high-quality image generation.