4 Jul 2024 | Jingjing Ren¹, Wenbo Li²*, Haoyu Chen¹, Renjing Pei², Bin Shao², Yong Guo³, Long Peng², Fenglong Song², Lei Zhu¹,⁴†
UltraPixel is a novel architecture for generating ultra-high-resolution images with high quality and efficiency. It uses cascade diffusion models to produce images at multiple resolutions (e.g., 1K to 6K) within a single model, while maintaining computational efficiency. The model leverages semantics-rich representations of lower-resolution images in the later denoising stage to guide the generation of highly detailed high-resolution images, significantly reducing complexity. Additionally, it introduces implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Notably, both low- and high-resolution processes are performed in the most compact space, sharing the majority of parameters with less than 3% additional parameters for high-resolution outputs, largely enhancing training and inference efficiency. The model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.
UltraPixel addresses the challenges of ultra-high-resolution image generation, such as increased semantic planning complexity and detail synthesis difficulties, alongside substantial training resource demands. It utilizes a cascaded decoding strategy that combines diffusion and variational autoencoder (VAE), achieving a 42:1 compression ratio, enabling a more compact feature representation. Additionally, the cascade decoder can process features at various resolutions, as illustrated in the appendix. This capability inspires the generation of higher-resolution representations within its most compact space, thereby enhancing both training and inference efficiency. However, directly performing semantic planning and detail synthesis at larger scales remains challenging. Due to the distribution gap across different resolutions (i.e., scattered clusters in the t-SNE visualization in Figure 2), existing models struggle to produce visually pleasing and semantically coherent results. For example, they often result in overly dark images with unpleasant artifacts.
UltraPixel introduces a high-quality ultra-high-resolution image generation method. By incorporating semantics-rich representations of low-resolution images in the later stage as guidance, the model comprehends the global semantic layout from the beginning, effectively fusing text information and focusing on detail refinement. The process operates in a compact space, with low- and high-resolution generation sharing the majority of parameters and requiring less than 3% additional parameters for the high-resolution branch, ensuring high efficiency. Unlike conventional methods that necessitate separate parameters for different resolutions, our network accommodates varying resolutions and is highly resource-friendly. We achieve this by learning implicit neural representations to upscale low-resolution features, ensuring continuous guidance, and by developing scale-aware, learnable normalization layers to adapt to numerical differences across resolutions. Our model, trained on 1 million high-quality images of diverse sizes, demonstrates the capability to produce photo-realistic images at multiple resolutions (e.g., from 1K to 6K with varying aspect ratios) efficiently in both training and inference phases. The image quality of our method is comparable to leading closed-source T2I commercial products, such as Midjourney V6 and DALL·E 3. Moreover, we demonstrate the application of ControlNet and personalUltraPixel is a novel architecture for generating ultra-high-resolution images with high quality and efficiency. It uses cascade diffusion models to produce images at multiple resolutions (e.g., 1K to 6K) within a single model, while maintaining computational efficiency. The model leverages semantics-rich representations of lower-resolution images in the later denoising stage to guide the generation of highly detailed high-resolution images, significantly reducing complexity. Additionally, it introduces implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Notably, both low- and high-resolution processes are performed in the most compact space, sharing the majority of parameters with less than 3% additional parameters for high-resolution outputs, largely enhancing training and inference efficiency. The model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.
UltraPixel addresses the challenges of ultra-high-resolution image generation, such as increased semantic planning complexity and detail synthesis difficulties, alongside substantial training resource demands. It utilizes a cascaded decoding strategy that combines diffusion and variational autoencoder (VAE), achieving a 42:1 compression ratio, enabling a more compact feature representation. Additionally, the cascade decoder can process features at various resolutions, as illustrated in the appendix. This capability inspires the generation of higher-resolution representations within its most compact space, thereby enhancing both training and inference efficiency. However, directly performing semantic planning and detail synthesis at larger scales remains challenging. Due to the distribution gap across different resolutions (i.e., scattered clusters in the t-SNE visualization in Figure 2), existing models struggle to produce visually pleasing and semantically coherent results. For example, they often result in overly dark images with unpleasant artifacts.
UltraPixel introduces a high-quality ultra-high-resolution image generation method. By incorporating semantics-rich representations of low-resolution images in the later stage as guidance, the model comprehends the global semantic layout from the beginning, effectively fusing text information and focusing on detail refinement. The process operates in a compact space, with low- and high-resolution generation sharing the majority of parameters and requiring less than 3% additional parameters for the high-resolution branch, ensuring high efficiency. Unlike conventional methods that necessitate separate parameters for different resolutions, our network accommodates varying resolutions and is highly resource-friendly. We achieve this by learning implicit neural representations to upscale low-resolution features, ensuring continuous guidance, and by developing scale-aware, learnable normalization layers to adapt to numerical differences across resolutions. Our model, trained on 1 million high-quality images of diverse sizes, demonstrates the capability to produce photo-realistic images at multiple resolutions (e.g., from 1K to 6K with varying aspect ratios) efficiently in both training and inference phases. The image quality of our method is comparable to leading closed-source T2I commercial products, such as Midjourney V6 and DALL·E 3. Moreover, we demonstrate the application of ControlNet and personal