[slides and audio] Make a Cheap Scaling%3A A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

This paper introduces a novel self-cascade diffusion model designed to efficiently adapt pre-trained low-resolution diffusion models for higher-resolution image and video generation. The model leverages the rich knowledge learned from a well-trained low-resolution model to rapidly adapt to higher resolutions, avoiding the need for extensive fine-tuning and computational resources. The key contributions include: 1. **Pivot-Guided Noise Re-Schedule**: This strategy cyclically re-uses the low-resolution model to guide the synthesis of detailed structures at higher resolutions, preserving semantic and local structural details. 2. **Time-Aware Feature Upsampler**: A series of lightweight, learnable modules are introduced to incorporate side information from high-quality images during fine-tuning, achieving a 5× training speed-up with only 0.002M additional parameters. 3. **Efficiency and Robustness**: The proposed method achieves state-of-the-art performance in both tuning-free and fine-tuning settings, demonstrating superior object composition and local structure generation capabilities. Experiments on image and video synthesis datasets (Laion-5B and Webvid-10M) show that the method can adapt to higher resolutions with minimal fine-tuning steps, achieving comparable or better results compared to full fine-tuning methods. The code for the proposed method is available at <https://github.com/GuoLanqing/Self-Cascade/>.This paper introduces a novel self-cascade diffusion model designed to efficiently adapt pre-trained low-resolution diffusion models for higher-resolution image and video generation. The model leverages the rich knowledge learned from a well-trained low-resolution model to rapidly adapt to higher resolutions, avoiding the need for extensive fine-tuning and computational resources. The key contributions include: 1. **Pivot-Guided Noise Re-Schedule**: This strategy cyclically re-uses the low-resolution model to guide the synthesis of detailed structures at higher resolutions, preserving semantic and local structural details. 2. **Time-Aware Feature Upsampler**: A series of lightweight, learnable modules are introduced to incorporate side information from high-quality images during fine-tuning, achieving a 5× training speed-up with only 0.002M additional parameters. 3. **Efficiency and Robustness**: The proposed method achieves state-of-the-art performance in both tuning-free and fine-tuning settings, demonstrating superior object composition and local structure generation capabilities. Experiments on image and video synthesis datasets (Laion-5B and Webvid-10M) show that the method can adapt to higher resolutions with minimal fine-tuning steps, achieving comparable or better results compared to full fine-tuning methods. The code for the proposed method is available at <https://github.com/GuoLanqing/Self-Cascade/>.

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

16 Feb 2024 | Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen