This paper introduces a novel self-cascade diffusion model designed to efficiently adapt pre-trained low-resolution diffusion models for higher-resolution image and video generation. The model leverages the rich knowledge learned from a well-trained low-resolution model to rapidly adapt to higher resolutions, avoiding the need for extensive fine-tuning and computational resources. The key contributions include:
1. **Pivot-Guided Noise Re-Schedule**: This strategy cyclically re-uses the low-resolution model to guide the synthesis of detailed structures at higher resolutions, preserving semantic and local structural details.
2. **Time-Aware Feature Upsampler**: A series of lightweight, learnable modules are introduced to incorporate side information from high-quality images during fine-tuning, achieving a 5× training speed-up with only 0.002M additional parameters.
3. **Efficiency and Robustness**: The proposed method achieves state-of-the-art performance in both tuning-free and fine-tuning settings, demonstrating superior object composition and local structure generation capabilities.
Experiments on image and video synthesis datasets (Laion-5B and Webvid-10M) show that the method can adapt to higher resolutions with minimal fine-tuning steps, achieving comparable or better results compared to full fine-tuning methods. The code for the proposed method is available at <https://github.com/GuoLanqing/Self-Cascade/>.This paper introduces a novel self-cascade diffusion model designed to efficiently adapt pre-trained low-resolution diffusion models for higher-resolution image and video generation. The model leverages the rich knowledge learned from a well-trained low-resolution model to rapidly adapt to higher resolutions, avoiding the need for extensive fine-tuning and computational resources. The key contributions include:
1. **Pivot-Guided Noise Re-Schedule**: This strategy cyclically re-uses the low-resolution model to guide the synthesis of detailed structures at higher resolutions, preserving semantic and local structural details.
2. **Time-Aware Feature Upsampler**: A series of lightweight, learnable modules are introduced to incorporate side information from high-quality images during fine-tuning, achieving a 5× training speed-up with only 0.002M additional parameters.
3. **Efficiency and Robustness**: The proposed method achieves state-of-the-art performance in both tuning-free and fine-tuning settings, demonstrating superior object composition and local structure generation capabilities.
Experiments on image and video synthesis datasets (Laion-5B and Webvid-10M) show that the method can adapt to higher resolutions with minimal fine-tuning steps, achieving comparable or better results compared to full fine-tuning methods. The code for the proposed method is available at <https://github.com/GuoLanqing/Self-Cascade/>.