Scaling Diffusion Transformers to 16 Billion Parameters

Scaling Diffusion Transformers to 16 Billion Parameters

16 Jul 2024 | Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang
This paper introduces DiT-MoE, a sparse version of the diffusion Transformer that scales well and performs competitively with dense networks while optimizing inference efficiency. DiT-MoE incorporates two key designs: shared expert routing and expert-level balance loss, which help capture common knowledge and reduce redundancy among different routed experts. The authors conduct a deep analysis of expert specialization in conditional image generation, observing that expert selection is influenced by spatial position and denoising time steps, but is less sensitive to class-conditional information. As the MoE layers deepen, expert selection shifts from specific spatial positions to a more dispersed and balanced distribution. The model is evaluated on ImageNet benchmarks, achieving state-of-the-art performance with significantly reduced computational costs. Notably, DiT-MoE scales to 16.5 billion parameters, achieving a new FID-50K score of 1.80 at 512×512 resolution. The project page is available at https://github.com/feizc/DiT-MoE.This paper introduces DiT-MoE, a sparse version of the diffusion Transformer that scales well and performs competitively with dense networks while optimizing inference efficiency. DiT-MoE incorporates two key designs: shared expert routing and expert-level balance loss, which help capture common knowledge and reduce redundancy among different routed experts. The authors conduct a deep analysis of expert specialization in conditional image generation, observing that expert selection is influenced by spatial position and denoising time steps, but is less sensitive to class-conditional information. As the MoE layers deepen, expert selection shifts from specific spatial positions to a more dispersed and balanced distribution. The model is evaluated on ImageNet benchmarks, achieving state-of-the-art performance with significantly reduced computational costs. Notably, DiT-MoE scales to 16.5 billion parameters, achieving a new FID-50K score of 1.80 at 512×512 resolution. The project page is available at https://github.com/feizc/DiT-MoE.
Reach us at info@study.space
[slides and audio] Scaling Diffusion Transformers to 16 Billion Parameters