[slides] Scalable Diffusion Models with Transformers

The paper introduces a new class of diffusion models called Diffusion Transformers (DiTs), which replace the commonly used U-Net backbone with a transformer architecture. The authors analyze the scalability of DiTs through forward pass complexity measured by GFlops, finding that higher GFlops consistently lead to lower FID scores. The largest DiT-XL/2 model outperforms all prior diffusion models on the class-conditional ImageNet 512×512 and 256×256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter. The paper also explores different block designs within DiTs, including in-context conditioning, cross-attention, adaptive layer norm, and adaLN-Zero, with adaLN-Zero being the most compute-efficient and effective. The authors conclude that DiTs inherit the excellent scaling properties of transformers and show promise for future generative modeling research.The paper introduces a new class of diffusion models called Diffusion Transformers (DiTs), which replace the commonly used U-Net backbone with a transformer architecture. The authors analyze the scalability of DiTs through forward pass complexity measured by GFlops, finding that higher GFlops consistently lead to lower FID scores. The largest DiT-XL/2 model outperforms all prior diffusion models on the class-conditional ImageNet 512×512 and 256×256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter. The paper also explores different block designs within DiTs, including in-context conditioning, cross-attention, adaptive layer norm, and adaLN-Zero, with adaLN-Zero being the most compute-efficient and effective. The authors conclude that DiTs inherit the excellent scaling properties of transformers and show promise for future generative modeling research.

Scalable Diffusion Models with Transformers

2 Mar 2023 | William Peebles, Saining Xie