4 Jul 2024 | Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, Lei Zhu
**UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks**
**Project Page:** <https://jingjingrenabc.github.io/ultrapixel>
**Abstract:**
UltraPixel is a novel architecture that utilizes cascade diffusion models to generate high-quality images at multiple resolutions (e.g., 1K to 6K) within a single model, while maintaining computational efficiency. It leverages semantics-rich representations of lower-resolution images in the later denoising stage to guide the generation of highly detailed high-resolution images, significantly reducing complexity. UltraPixel introduces implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Both low- and high-resolution processes are performed in a compact space, sharing most parameters with less than 3% additional parameters for high-resolution outputs, enhancing training and inference efficiency. The model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.
**Introduction:**
The demand for high-resolution images has surged due to advanced display technologies and professional applications. Existing text-to-image (T2I) models struggle to scale to higher resolutions, often producing artifacts and requiring manual adjustments. UltraPixel addresses these challenges by incorporating semantics-rich representations of low-resolution images to guide high-resolution generation, enhancing visual quality and consistency. The method operates in a compact space, sharing parameters between low- and high-resolution processes, ensuring high efficiency.
**Method:**
UltraPixel uses a cascade architecture to generate images at various resolutions. It extracts guidance from low-resolution (LR) image synthesis and up scales it using implicit neural representations (INR) to guide high-resolution (HR) generation. Scale-aware normalization layers adapt to different resolutions, enhancing model adaptability. The training objective is defined to minimize the difference between generated and target images.
**Experiments:**
UltraPixel is trained on 1M images of varying resolutions and aspect ratios. It outperforms state-of-the-art methods in terms of PickScore, FID, IS, and CLIP scores, demonstrating superior visual quality and efficiency. Ablation studies show the effectiveness of LR guidance, INR, and scale-aware normalization.
**Conclusion:**
UltraPixel is an efficient framework for generating high-quality images at varying resolutions. It leverages a compact latent space and semantic guidance to simplify semantic planning and detail synthesis, achieving state-of-the-art performance and efficiency in ultra-high-resolution image generation.**UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks**
**Project Page:** <https://jingjingrenabc.github.io/ultrapixel>
**Abstract:**
UltraPixel is a novel architecture that utilizes cascade diffusion models to generate high-quality images at multiple resolutions (e.g., 1K to 6K) within a single model, while maintaining computational efficiency. It leverages semantics-rich representations of lower-resolution images in the later denoising stage to guide the generation of highly detailed high-resolution images, significantly reducing complexity. UltraPixel introduces implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Both low- and high-resolution processes are performed in a compact space, sharing most parameters with less than 3% additional parameters for high-resolution outputs, enhancing training and inference efficiency. The model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.
**Introduction:**
The demand for high-resolution images has surged due to advanced display technologies and professional applications. Existing text-to-image (T2I) models struggle to scale to higher resolutions, often producing artifacts and requiring manual adjustments. UltraPixel addresses these challenges by incorporating semantics-rich representations of low-resolution images to guide high-resolution generation, enhancing visual quality and consistency. The method operates in a compact space, sharing parameters between low- and high-resolution processes, ensuring high efficiency.
**Method:**
UltraPixel uses a cascade architecture to generate images at various resolutions. It extracts guidance from low-resolution (LR) image synthesis and up scales it using implicit neural representations (INR) to guide high-resolution (HR) generation. Scale-aware normalization layers adapt to different resolutions, enhancing model adaptability. The training objective is defined to minimize the difference between generated and target images.
**Experiments:**
UltraPixel is trained on 1M images of varying resolutions and aspect ratios. It outperforms state-of-the-art methods in terms of PickScore, FID, IS, and CLIP scores, demonstrating superior visual quality and efficiency. Ablation studies show the effectiveness of LR guidance, INR, and scale-aware normalization.
**Conclusion:**
UltraPixel is an efficient framework for generating high-quality images at varying resolutions. It leverages a compact latent space and semantic guidance to simplify semantic planning and detail synthesis, achieving state-of-the-art performance and efficiency in ultra-high-resolution image generation.