[slides] Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

The paper introduces the Hourglass Diffusion Transformer (HDiT), a novel image generative model that scales linearly with pixel count, enabling high-resolution (e.g., 1024×1024) image synthesis directly in pixel space. Building on the Transformer architecture, HDiT bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. The model trains without typical high-resolution techniques such as multi-scale architectures or latent autoencoders, demonstrating competitive performance on ImageNet 256 and setting a new state-of-the-art for diffusion models on FFHQ-1024. Key contributions include: - Adaptation of transformer-based diffusion backbones for efficient, high-quality pixel-space image generation. - Introduction of the HDiT architecture for high-resolution pixel-space image generation with subquadratic computational complexity scaling. - Demonstration of HDiT's ability to achieve high-quality direct pixel-space generation at resolutions of 1024×1024 without requiring high-resolution-specific training tricks. The paper also discusses related work, including the challenges of high-resolution image synthesis with diffusion models and the improvements in transformer architectures. Experimental results show that HDiT outperforms existing models in terms of FID and IS metrics, both in megapixel-scale pixel-space image synthesis and large-scale ImageNet image synthesis.The paper introduces the Hourglass Diffusion Transformer (HDiT), a novel image generative model that scales linearly with pixel count, enabling high-resolution (e.g., 1024×1024) image synthesis directly in pixel space. Building on the Transformer architecture, HDiT bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. The model trains without typical high-resolution techniques such as multi-scale architectures or latent autoencoders, demonstrating competitive performance on ImageNet 256 and setting a new state-of-the-art for diffusion models on FFHQ-1024. Key contributions include: - Adaptation of transformer-based diffusion backbones for efficient, high-quality pixel-space image generation. - Introduction of the HDiT architecture for high-resolution pixel-space image generation with subquadratic computational complexity scaling. - Demonstration of HDiT's ability to achieve high-quality direct pixel-space generation at resolutions of 1024×1024 without requiring high-resolution-specific training tricks. The paper also discusses related work, including the challenges of high-resolution image synthesis with diffusion models and the improvements in transformer architectures. Experimental results show that HDiT outperforms existing models in terms of FID and IS metrics, both in megapixel-scale pixel-space image synthesis and large-scale ImageNet image synthesis.

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

21 Jan 2024 | Katherine Crowson *1 Stefan Andreas Baumann *2 Alex Birch *3 Tanishq Mathew Abraham 1 Daniel Z. Kaplan 4 Enrico Shippole 5

21 Jan 2024 | Katherine Crowson 1 Stefan Andreas Baumann 2 Alex Birch *3 Tanishq Mathew Abraham 1 Daniel Z. Kaplan 4 Enrico Shippole 5