21 Jan 2024 | Katherine Crowson *1 Stefan Andreas Baumann *2 Alex Birch *3 Tanishq Mathew Abraham 1 Daniel Z. Kaplan 4 Enrico Shippole 5
The paper introduces the Hourglass Diffusion Transformer (HDiT), a novel image generative model that scales linearly with pixel count, enabling high-resolution (e.g., 1024×1024) image synthesis directly in pixel space. Building on the Transformer architecture, HDiT bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. The model trains without typical high-resolution techniques such as multi-scale architectures or latent autoencoders, demonstrating competitive performance on ImageNet 256 and setting a new state-of-the-art for diffusion models on FFHQ-1024.
Key contributions include:
- Adaptation of transformer-based diffusion backbones for efficient, high-quality pixel-space image generation.
- Introduction of the HDiT architecture for high-resolution pixel-space image generation with subquadratic computational complexity scaling.
- Demonstration of HDiT's ability to achieve high-quality direct pixel-space generation at resolutions of 1024×1024 without requiring high-resolution-specific training tricks.
The paper also discusses related work, including the challenges of high-resolution image synthesis with diffusion models and the improvements in transformer architectures. Experimental results show that HDiT outperforms existing models in terms of FID and IS metrics, both in megapixel-scale pixel-space image synthesis and large-scale ImageNet image synthesis.The paper introduces the Hourglass Diffusion Transformer (HDiT), a novel image generative model that scales linearly with pixel count, enabling high-resolution (e.g., 1024×1024) image synthesis directly in pixel space. Building on the Transformer architecture, HDiT bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. The model trains without typical high-resolution techniques such as multi-scale architectures or latent autoencoders, demonstrating competitive performance on ImageNet 256 and setting a new state-of-the-art for diffusion models on FFHQ-1024.
Key contributions include:
- Adaptation of transformer-based diffusion backbones for efficient, high-quality pixel-space image generation.
- Introduction of the HDiT architecture for high-resolution pixel-space image generation with subquadratic computational complexity scaling.
- Demonstration of HDiT's ability to achieve high-quality direct pixel-space generation at resolutions of 1024×1024 without requiring high-resolution-specific training tricks.
The paper also discusses related work, including the challenges of high-resolution image synthesis with diffusion models and the improvements in transformer architectures. Experimental results show that HDiT outperforms existing models in terms of FID and IS metrics, both in megapixel-scale pixel-space image synthesis and large-scale ImageNet image synthesis.