U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

3 Jun 2024 | Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers Diffusion Transformers (DiTs) apply the transformer architecture to diffusion tasks for latent-space image generation. DiTs have shown competitive performance and good scalability, but they have abandoned the U-Net architecture, which is widely used in previous works. This paper re-examines the use of U-Net in DiTs. A simple toy experiment shows that the U-Net architecture only gains a slight advantage, indicating potential redundancies in U-Net-style DiTs. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, the paper performs token downsampling on the query-key-value tuple for self-attention, which brings further improvements despite a considerable reduction in computation. Based on self-attention with downsampled tokens, the paper proposes a series of U-shaped Diffusion Transformers (U-DiTs) and conducts extensive experiments to demonstrate their performance. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. The paper investigates the performance of U-Net DiTs in latent space. A canonical U-Net-style DiT is proposed, which is a naive Transformer-backboned U-Net denoiser. The U-Net-style DiT is compared with an isotropic DiT of similar size. Results show that the U-Net style DiT only gains a limited advantage over the original isotropic DiT. The inductive bias of U-Net is insufficiently utilized. Inspired by previous wisdom on diffusion, the paper proposes to downsample the visual tokens for self-attention and yield extraordinary results: the performance is further improved despite a huge cut on FLOPs. From this interesting discovery, the paper scales the U-Net architecture up and proposes a series of U-shaped DiTs. The paper also conducts various experiments to demonstrate the outstanding performance and scalability of U-DiTs. The results show that U-DiTs outperform DiTs by large margins. The proposed U-DiT model could perform better than DiT-XL/2 which is 6 times larger in terms of FLOPs. The paper also discusses the limitations of the current work, such as the lack of computation resources and tight schedule. The paper also discusses the broader impacts of the work, such as the potential for misuse due to biases in the training data set. The paper also discusses the discussion of safeguards, such as the need for an algorithm capable of checking generated images to identify and mitigate content that contravenes legal or ethical usages.U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers Diffusion Transformers (DiTs) apply the transformer architecture to diffusion tasks for latent-space image generation. DiTs have shown competitive performance and good scalability, but they have abandoned the U-Net architecture, which is widely used in previous works. This paper re-examines the use of U-Net in DiTs. A simple toy experiment shows that the U-Net architecture only gains a slight advantage, indicating potential redundancies in U-Net-style DiTs. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, the paper performs token downsampling on the query-key-value tuple for self-attention, which brings further improvements despite a considerable reduction in computation. Based on self-attention with downsampled tokens, the paper proposes a series of U-shaped Diffusion Transformers (U-DiTs) and conducts extensive experiments to demonstrate their performance. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. The paper investigates the performance of U-Net DiTs in latent space. A canonical U-Net-style DiT is proposed, which is a naive Transformer-backboned U-Net denoiser. The U-Net-style DiT is compared with an isotropic DiT of similar size. Results show that the U-Net style DiT only gains a limited advantage over the original isotropic DiT. The inductive bias of U-Net is insufficiently utilized. Inspired by previous wisdom on diffusion, the paper proposes to downsample the visual tokens for self-attention and yield extraordinary results: the performance is further improved despite a huge cut on FLOPs. From this interesting discovery, the paper scales the U-Net architecture up and proposes a series of U-shaped DiTs. The paper also conducts various experiments to demonstrate the outstanding performance and scalability of U-DiTs. The results show that U-DiTs outperform DiTs by large margins. The proposed U-DiT model could perform better than DiT-XL/2 which is 6 times larger in terms of FLOPs. The paper also discusses the limitations of the current work, such as the lack of computation resources and tight schedule. The paper also discusses the broader impacts of the work, such as the potential for misuse due to biases in the training data set. The paper also discusses the discussion of safeguards, such as the need for an algorithm capable of checking generated images to identify and mitigate content that contravenes legal or ethical usages.
Reach us at info@study.space
[slides and audio] U-DiTs%3A Downsample Tokens in U-Shaped Diffusion Transformers