13 Jun 2024 | Qihao Liu1,2*, Zhanpeng Zeng1,3*, Ju He1,2*, Qihang Yu1, Xiaohui Shen1, Liang-Chieh Chen1
This paper introduces DiMR (Multi-Resolution Diffusion Model), a novel approach to enhance diffusion models by integrating a multi-resolution network and time-dependent layer normalization. The primary goal is to improve the visual fidelity and reduce distortion in high-fidelity image generation. Traditional diffusion models, often based on U-Net architectures, face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations in Transformer-based designs. DiMR addresses this challenge by employing a feature cascade approach, where features are progressively refined from low to high resolutions, ensuring better capture of fine-grained details while maintaining computational efficiency.
The multi-resolution network consists of multiple branches, each handling different resolutions. The first branch processes the lowest resolution using powerful Transformer blocks, while higher-resolution branches use ConvNeXt blocks, which are efficient for high-resolution features. This cascaded structure helps in upscaling lower-resolution features to higher resolutions, reducing image distortions.
Additionally, the paper proposes Time-Dependent Layer Normalization (TD-LN), a parameter-efficient method that integrates time-dependent parameters directly into layer normalization, enhancing the performance of the diffusion model. TD-LN is designed to be more flexible and efficient compared to adaptive layer normalization (AdaLN-Zero), which requires a parameter-heavy MLP.
The effectiveness of DiMR is demonstrated through extensive experiments on the ImageNet dataset, achieving state-of-the-art performance in class-conditional image generation at resolutions of 64 × 64, 256 × 256, and 512 × 512. DiMR outperforms existing diffusion models, including U-ViT and DiT, with significantly fewer parameters and computational resources. The paper also includes ablation studies to validate the contributions of each component in the DiMR architecture.
Overall, DiMR provides a robust solution for generating high-fidelity images with reduced distortion, making it a significant advancement in the field of image generation.This paper introduces DiMR (Multi-Resolution Diffusion Model), a novel approach to enhance diffusion models by integrating a multi-resolution network and time-dependent layer normalization. The primary goal is to improve the visual fidelity and reduce distortion in high-fidelity image generation. Traditional diffusion models, often based on U-Net architectures, face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations in Transformer-based designs. DiMR addresses this challenge by employing a feature cascade approach, where features are progressively refined from low to high resolutions, ensuring better capture of fine-grained details while maintaining computational efficiency.
The multi-resolution network consists of multiple branches, each handling different resolutions. The first branch processes the lowest resolution using powerful Transformer blocks, while higher-resolution branches use ConvNeXt blocks, which are efficient for high-resolution features. This cascaded structure helps in upscaling lower-resolution features to higher resolutions, reducing image distortions.
Additionally, the paper proposes Time-Dependent Layer Normalization (TD-LN), a parameter-efficient method that integrates time-dependent parameters directly into layer normalization, enhancing the performance of the diffusion model. TD-LN is designed to be more flexible and efficient compared to adaptive layer normalization (AdaLN-Zero), which requires a parameter-heavy MLP.
The effectiveness of DiMR is demonstrated through extensive experiments on the ImageNet dataset, achieving state-of-the-art performance in class-conditional image generation at resolutions of 64 × 64, 256 × 256, and 512 × 512. DiMR outperforms existing diffusion models, including U-ViT and DiT, with significantly fewer parameters and computational resources. The paper also includes ablation studies to validate the contributions of each component in the DiMR architecture.
Overall, DiMR provides a robust solution for generating high-fidelity images with reduced distortion, making it a significant advancement in the field of image generation.