21 Jan 2024 | Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, Enrico Shippole
This paper introduces the Hourglass Diffusion Transformer (HDiT), a new image generative model that achieves linear scaling with pixel count, enabling high-resolution image generation directly in pixel space. HDiT is based on the Transformer architecture, which is known for its scalability to billions of parameters, and bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. Unlike traditional methods that require complex training techniques such as multiscale architectures or latent autoencoders, HDiT trains successfully without these, demonstrating competitive performance with existing models on ImageNet 256 and setting a new state-of-the-art for diffusion models on FFHQ-1024.
HDiT is designed to scale efficiently with resolution, achieving a computational complexity of O(n) instead of O(n²) for image generation. This makes it significantly more efficient than standard diffusion transformer backbones like DiT, especially at high resolutions. The model is capable of generating high-quality images at resolutions up to 1024×1024 without requiring high-resolution-specific training tricks. HDiT's architecture is inspired by the hierarchical structure of images and incorporates a range of architectural improvements, including a skip merging mechanism that allows for efficient information transfer between different levels of the model.
The paper also presents an extensive experimental evaluation of HDiT, demonstrating its effectiveness in both high-resolution pixel-space image synthesis and large-scale ImageNet image generation. The results show that HDiT outperforms existing models in terms of image quality and computational efficiency, particularly in high-resolution settings. The model is also shown to be competitive with other transformer-based diffusion backbones, even at smaller resolutions, and is more efficient than U-Nets in pixel-space settings.
The paper concludes that HDiT provides a promising foundation for further research into efficient high-resolution image synthesis. While the current work focuses on unconditional and class-conditional image synthesis, the architecture is likely well-suited for other generative tasks such as super-resolution, text-to-image generation, and synthesis of other modalities like audio and video. Future work could explore applying HDiT in latent diffusion setups to achieve even higher efficiency and multi-megapixel image resolutions.This paper introduces the Hourglass Diffusion Transformer (HDiT), a new image generative model that achieves linear scaling with pixel count, enabling high-resolution image generation directly in pixel space. HDiT is based on the Transformer architecture, which is known for its scalability to billions of parameters, and bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. Unlike traditional methods that require complex training techniques such as multiscale architectures or latent autoencoders, HDiT trains successfully without these, demonstrating competitive performance with existing models on ImageNet 256 and setting a new state-of-the-art for diffusion models on FFHQ-1024.
HDiT is designed to scale efficiently with resolution, achieving a computational complexity of O(n) instead of O(n²) for image generation. This makes it significantly more efficient than standard diffusion transformer backbones like DiT, especially at high resolutions. The model is capable of generating high-quality images at resolutions up to 1024×1024 without requiring high-resolution-specific training tricks. HDiT's architecture is inspired by the hierarchical structure of images and incorporates a range of architectural improvements, including a skip merging mechanism that allows for efficient information transfer between different levels of the model.
The paper also presents an extensive experimental evaluation of HDiT, demonstrating its effectiveness in both high-resolution pixel-space image synthesis and large-scale ImageNet image generation. The results show that HDiT outperforms existing models in terms of image quality and computational efficiency, particularly in high-resolution settings. The model is also shown to be competitive with other transformer-based diffusion backbones, even at smaller resolutions, and is more efficient than U-Nets in pixel-space settings.
The paper concludes that HDiT provides a promising foundation for further research into efficient high-resolution image synthesis. While the current work focuses on unconditional and class-conditional image synthesis, the architecture is likely well-suited for other generative tasks such as super-resolution, text-to-image generation, and synthesis of other modalities like audio and video. Future work could explore applying HDiT in latent diffusion setups to achieve even higher efficiency and multi-megapixel image resolutions.