5 Jun 2024 | Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao
Lumina-Next improves Lumina-T2X by enhancing generation performance and efficiency. It introduces Next-DiT with 3D RoPE and sandwich normalization, enabling better resolution extrapolation and multilingual generation. The framework also incorporates Frequency- and Time-Aware Scaled RoPE for improved resolution and detail preservation. Optimized time schedules and higher-order solvers reduce sampling steps, while Time-Aware Context Drop merges redundant tokens for faster inference. Lumina-Next demonstrates superior performance in text-to-image generation, multi-view, audio, music, and point cloud generation. It supports zero-shot multilingual generation using decoder-based LLMs as text encoders. The framework is versatile, capable of handling various modalities and resolutions. All code and model weights are available for further research and development.Lumina-Next improves Lumina-T2X by enhancing generation performance and efficiency. It introduces Next-DiT with 3D RoPE and sandwich normalization, enabling better resolution extrapolation and multilingual generation. The framework also incorporates Frequency- and Time-Aware Scaled RoPE for improved resolution and detail preservation. Optimized time schedules and higher-order solvers reduce sampling steps, while Time-Aware Context Drop merges redundant tokens for faster inference. Lumina-Next demonstrates superior performance in text-to-image generation, multi-view, audio, music, and point cloud generation. It supports zero-shot multilingual generation using decoder-based LLMs as text encoders. The framework is versatile, capable of handling various modalities and resolutions. All code and model weights are available for further research and development.