Scaling Diffusion Transformers to 16 Billion Parameters

Scaling Diffusion Transformers to 16 Billion Parameters

2024-07-16 | Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang
This paper presents DiT-MoE, a sparse version of the diffusion Transformer that is scalable and competitive with dense networks while exhibiting highly optimized inference. DiT-MoE incorporates two simple designs: shared expert routing and expert-level balance loss, which help capture common knowledge and reduce redundancy among different routed experts. When applied to conditional image generation, a deep analysis of experts' specialization reveals that expert selection shows preference with spatial position and denoising time step, while being insensitive to different class-conditional information. As the MoE layers go deeper, the selection of experts gradually shifts from specific spatial position to dispersion and balance. Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. This is attributed to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on this, DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, the authors demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion models to 16.5B parameters that attain a new SoTA FID-50K score of 1.80 in 512×512 resolution settings. The paper also explores the application of sparsity to diffusion models within the context of the Diffusion Transformers (DiT). DiT has demonstrated superior scalability across various parameter settings, achieving enhanced generative performance compared to CNN-based U-Net architectures with higher training computation efficiency. DiT processes images as a sequence of patches, and the standard MLPs consist of two layers and a GeLU non-linearity. For DiT-MoE, a subset of these are replaced with MoE layers, where each expert is an MLP. The experts share the same architecture and follow a similar design pattern as previous works. The paper provides a comprehensive analysis of the expert routing mechanism, demonstrating that these designs offer opportunities to train a parameter-efficient MoE diffusion model while some interesting phenomena about expert routing from different perspectives are observed. The authors validate the benefits of DiT-MoE architecture and present an effective recipe for the scale training of DiT-MoE. They then conduct an evaluation of class-based image generation in the ImageNet benchmarks. Experiment results indicate that DiT-MoE matches the performance of state-of-the-art dense models, while requiring less time to inference. Alternatively, DiT-MoE-S can match the cost of DiT-B while achieving better performance. Leveraging with additional synthesis data, the authors subsequently scale up the model parameters to 16.5B while only activating 3.1B parameters, which attains a new state-of-the-art FID-50K score of 1.80 in 512×512 resolution. The paper concludes that DiT-MoE is a promising approach for scaling diffusion models, achieving efficient inference andThis paper presents DiT-MoE, a sparse version of the diffusion Transformer that is scalable and competitive with dense networks while exhibiting highly optimized inference. DiT-MoE incorporates two simple designs: shared expert routing and expert-level balance loss, which help capture common knowledge and reduce redundancy among different routed experts. When applied to conditional image generation, a deep analysis of experts' specialization reveals that expert selection shows preference with spatial position and denoising time step, while being insensitive to different class-conditional information. As the MoE layers go deeper, the selection of experts gradually shifts from specific spatial position to dispersion and balance. Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. This is attributed to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on this, DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, the authors demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion models to 16.5B parameters that attain a new SoTA FID-50K score of 1.80 in 512×512 resolution settings. The paper also explores the application of sparsity to diffusion models within the context of the Diffusion Transformers (DiT). DiT has demonstrated superior scalability across various parameter settings, achieving enhanced generative performance compared to CNN-based U-Net architectures with higher training computation efficiency. DiT processes images as a sequence of patches, and the standard MLPs consist of two layers and a GeLU non-linearity. For DiT-MoE, a subset of these are replaced with MoE layers, where each expert is an MLP. The experts share the same architecture and follow a similar design pattern as previous works. The paper provides a comprehensive analysis of the expert routing mechanism, demonstrating that these designs offer opportunities to train a parameter-efficient MoE diffusion model while some interesting phenomena about expert routing from different perspectives are observed. The authors validate the benefits of DiT-MoE architecture and present an effective recipe for the scale training of DiT-MoE. They then conduct an evaluation of class-based image generation in the ImageNet benchmarks. Experiment results indicate that DiT-MoE matches the performance of state-of-the-art dense models, while requiring less time to inference. Alternatively, DiT-MoE-S can match the cost of DiT-B while achieving better performance. Leveraging with additional synthesis data, the authors subsequently scale up the model parameters to 16.5B while only activating 3.1B parameters, which attains a new state-of-the-art FID-50K score of 1.80 in 512×512 resolution. The paper concludes that DiT-MoE is a promising approach for scaling diffusion models, achieving efficient inference and
Reach us at info@study.space