This paper introduces Diffusion Transformers (DiTs), a new class of diffusion models based on the transformer architecture. DiTs replace the commonly-used U-Net backbone with a transformer that operates on latent patches. The authors analyze the scalability of DiTs through the lens of forward pass complexity as measured by Gflops. They find that DiTs with higher Gflops consistently have lower FID. In addition to possessing good scalability properties, their largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512×512 and 256×512 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
The authors explore the scaling behavior of transformers with respect to network complexity vs. sample quality. They show that by constructing and benchmarking the DiT design space under the Latent Diffusion Models (LDMs) framework, they can successfully replace the U-Net backbone with a transformer. They further show that DiTs are scalable architectures for diffusion models: there is a strong correlation between the network complexity (measured by Gflops) vs. sample quality (measured by FID). By simply scaling-up DiT and training an LDM with a high-capacity backbone (118.6 Gflops), they are able to achieve a state-of-the-art result of 2.27 FID on the class-conditional 256×512 ImageNet generation benchmark.
The authors also explore the impact of different transformer block designs on model performance. They find that the adaLN-Zero block yields lower FID than both cross-attention and in-context conditioning while being the most compute-efficient. They also find that increasing the Gflops in the model—either by increasing transformer depth/width or increasing the number of input tokens—yields significant improvements in visual fidelity.
The authors train DiT models on the ImageNet dataset at 256×512 and 512×512 image resolution. They find that increasing model size and decreasing patch size yields considerably improved diffusion models. They also find that DiT models are more compute-efficient than prior U-Net models. They also find that scaling-up sampling compute does not compensate for a lack of model compute.
The authors conclude that DiTs are a simple transformer-based backbone for diffusion models that outperform prior U-Net models and inherit the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL·E 2 and Stable Diffusion.This paper introduces Diffusion Transformers (DiTs), a new class of diffusion models based on the transformer architecture. DiTs replace the commonly-used U-Net backbone with a transformer that operates on latent patches. The authors analyze the scalability of DiTs through the lens of forward pass complexity as measured by Gflops. They find that DiTs with higher Gflops consistently have lower FID. In addition to possessing good scalability properties, their largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512×512 and 256×512 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
The authors explore the scaling behavior of transformers with respect to network complexity vs. sample quality. They show that by constructing and benchmarking the DiT design space under the Latent Diffusion Models (LDMs) framework, they can successfully replace the U-Net backbone with a transformer. They further show that DiTs are scalable architectures for diffusion models: there is a strong correlation between the network complexity (measured by Gflops) vs. sample quality (measured by FID). By simply scaling-up DiT and training an LDM with a high-capacity backbone (118.6 Gflops), they are able to achieve a state-of-the-art result of 2.27 FID on the class-conditional 256×512 ImageNet generation benchmark.
The authors also explore the impact of different transformer block designs on model performance. They find that the adaLN-Zero block yields lower FID than both cross-attention and in-context conditioning while being the most compute-efficient. They also find that increasing the Gflops in the model—either by increasing transformer depth/width or increasing the number of input tokens—yields significant improvements in visual fidelity.
The authors train DiT models on the ImageNet dataset at 256×512 and 512×512 image resolution. They find that increasing model size and decreasing patch size yields considerably improved diffusion models. They also find that DiT models are more compute-efficient than prior U-Net models. They also find that scaling-up sampling compute does not compensate for a lack of model compute.
The authors conclude that DiTs are a simple transformer-based backbone for diffusion models that outperform prior U-Net models and inherit the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL·E 2 and Stable Diffusion.