Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

16 Feb 2024 | Lanqing Guo¹,²†, Yingqing He²,³†, Haoxin Chen², Menghan Xia², Xiaodong Cun², Yufei Wang¹, Siyu Huang⁴, Yong Zhang²*, Xintao Wang², Qifeng Chen³, Ying Shan², Bihan Wen¹*
This paper proposes a self-cascade diffusion model for efficient higher-resolution adaptation. The model leverages a well-trained low-resolution diffusion model to enable rapid adaptation to higher resolutions with minimal computational resources. It introduces a pivot-guided noise re-scheduling strategy to facilitate tuning-free adaptation, and a series of time-aware feature upsampler modules for fine-tuning. The self-cascade diffusion model can efficiently adapt to higher resolutions while preserving the original composition and generation capabilities. Compared to full fine-tuning, the proposed approach achieves a 5× training speedup and requires only 0.002M tuning parameters. Extensive experiments demonstrate that the method can quickly adapt to higher-resolution image and video synthesis with just 10k fine-tuning steps and virtually no additional inference time. The model is flexible and can be plugged into any pre-trained diffusion-based synthesis models, including both image and video generation. The proposed approach achieves state-of-the-art performance in both tuning-free and tuning settings across various scale adaptations. The method is efficient, lightweight, and easy to scale, making it suitable for high-resolution image and video generation.This paper proposes a self-cascade diffusion model for efficient higher-resolution adaptation. The model leverages a well-trained low-resolution diffusion model to enable rapid adaptation to higher resolutions with minimal computational resources. It introduces a pivot-guided noise re-scheduling strategy to facilitate tuning-free adaptation, and a series of time-aware feature upsampler modules for fine-tuning. The self-cascade diffusion model can efficiently adapt to higher resolutions while preserving the original composition and generation capabilities. Compared to full fine-tuning, the proposed approach achieves a 5× training speedup and requires only 0.002M tuning parameters. Extensive experiments demonstrate that the method can quickly adapt to higher-resolution image and video synthesis with just 10k fine-tuning steps and virtually no additional inference time. The model is flexible and can be plugged into any pre-trained diffusion-based synthesis models, including both image and video generation. The proposed approach achieves state-of-the-art performance in both tuning-free and tuning settings across various scale adaptations. The method is efficient, lightweight, and easy to scale, making it suitable for high-resolution image and video generation.
Reach us at info@study.space