3 Jun 2024 | Yuchuan Tian1*, Zhijun Tu2*, Hanting Chen2, Jie Hu2, Chao Xu1, Yunhe Wang2†
The paper "U-DiT: Downsample Tokens in U-Shaped Diffusion Transformers" explores the integration of U-Net architecture into diffusion transformers (DiTs) for latent-space image generation. While DiTs have demonstrated strong performance and scalability, the authors question the abandonment of the widely used U-Net architecture. Through a toy experiment, they find that a U-Net-style DiT (DtT-UNet) performs only slightly better than an isotropic DiT, suggesting potential redundancies in the U-Net architecture. Inspired by the observation that U-Net features are dominated by low-frequency components, the authors propose token downsampling for self-attention, which significantly reduces computation while improving performance. This leads to the development of U-shaped Diffusion Transformers (U-DiTs), which outperform DiTs with a much lower computational cost. Extensive experiments show that U-DiTs achieve superior performance and scalability, outperforming DiT-XL2 with only 1/6 of its computation cost. The paper also discusses the design and evaluation of various modifications to U-DiTs, including downsampling techniques, attention mechanisms, and other architectural improvements.The paper "U-DiT: Downsample Tokens in U-Shaped Diffusion Transformers" explores the integration of U-Net architecture into diffusion transformers (DiTs) for latent-space image generation. While DiTs have demonstrated strong performance and scalability, the authors question the abandonment of the widely used U-Net architecture. Through a toy experiment, they find that a U-Net-style DiT (DtT-UNet) performs only slightly better than an isotropic DiT, suggesting potential redundancies in the U-Net architecture. Inspired by the observation that U-Net features are dominated by low-frequency components, the authors propose token downsampling for self-attention, which significantly reduces computation while improving performance. This leads to the development of U-shaped Diffusion Transformers (U-DiTs), which outperform DiTs with a much lower computational cost. Extensive experiments show that U-DiTs achieve superior performance and scalability, outperforming DiT-XL2 with only 1/6 of its computation cost. The paper also discusses the design and evaluation of various modifications to U-DiTs, including downsampling techniques, attention mechanisms, and other architectural improvements.