5 Jun 2024 | Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao
Lumina-Next is an improved version of Lumina-T2X, a Flow-based Large Diffusion Transformers (Flag-DiT) framework designed for generating various modalities, such as images and videos, from text instructions. Lumina-Next addresses the limitations of Lumina-T2X, including training instability, slow inference, and extrapolation artifacts, by introducing several key improvements:
1. **Next-DiT Architecture**: Revisits the Flag-DiT architecture and introduces 3D RoPE and sandwich normalizations to enhance resolution extrapolation and control network activation magnitudes.
2. **Frequency- and Time-Aware Scaled RoPE**: Proposes Frequency-Aware Scaled RoPE and Time-Aware Scaled RoPE to improve content diversity and global consistency during resolution extrapolation.
3. **Optimized Time Schedule and Higher-Order Solvers**: Develops a sigmoid time discretization schedule and higher-order ODE solvers to reduce sampling steps and improve sampling quality.
4. **Time-Aware Context Drop**: Introduces a method to merge redundant visual tokens during inference to speed up network evaluation.
These improvements result in better text-to-image generation quality, faster inference, and superior resolution extrapolation capabilities. Lumina-Next also demonstrates strong zero-shot multilingual generation using decoder-based LLMs as text encoders and can be extended to various modalities, including visual recognition, multi-view images, audio, music, and point cloud generation. The framework is released with all codes and model weights to advance the development of next-generation generative AI.Lumina-Next is an improved version of Lumina-T2X, a Flow-based Large Diffusion Transformers (Flag-DiT) framework designed for generating various modalities, such as images and videos, from text instructions. Lumina-Next addresses the limitations of Lumina-T2X, including training instability, slow inference, and extrapolation artifacts, by introducing several key improvements:
1. **Next-DiT Architecture**: Revisits the Flag-DiT architecture and introduces 3D RoPE and sandwich normalizations to enhance resolution extrapolation and control network activation magnitudes.
2. **Frequency- and Time-Aware Scaled RoPE**: Proposes Frequency-Aware Scaled RoPE and Time-Aware Scaled RoPE to improve content diversity and global consistency during resolution extrapolation.
3. **Optimized Time Schedule and Higher-Order Solvers**: Develops a sigmoid time discretization schedule and higher-order ODE solvers to reduce sampling steps and improve sampling quality.
4. **Time-Aware Context Drop**: Introduces a method to merge redundant visual tokens during inference to speed up network evaluation.
These improvements result in better text-to-image generation quality, faster inference, and superior resolution extrapolation capabilities. Lumina-Next also demonstrates strong zero-shot multilingual generation using decoder-based LLMs as text encoders and can be extended to various modalities, including visual recognition, multi-view images, audio, music, and point cloud generation. The framework is released with all codes and model weights to advance the development of next-generation generative AI.