Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

14 May 2024 | Zhimin Li*, Jianwei Zhang*, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyang Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Siuhan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu†
Hunyuan-DiT is a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. The model is designed with a new network architecture based on diffusion transformers, combining bilingual CLIP and multilingual T5 encoders to enhance language understanding and increase context length. A data pipeline is built to update and evaluate data for iterative model optimization. A multimodal large language model is used to refine image captions. Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through a holistic human evaluation protocol with over 50 professional evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared to other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT. Hunyuan-DiT supports multi-resolution training and inference, requiring appropriate positional encodings for different resolutions. Two types of positional encoding are used: Extended Positional Encoding and Centralized Interpolative Positional Encoding. The latter allows images with different resolutions to share similar positional encoding spaces, improving the efficiency of learning. To stabilize training, three techniques are used: QK-Norm, skip module normalization, and FP32 for certain operations. A data pipeline is built for data processing, including data acquisition, interpretation, layering, and application. The data category system covers a wide range of subjects and styles. A 'data convoy' mechanism is used to evaluate the impact of specialized data on the generative model. For fine-grained Chinese understanding, structural captions are used to comprehensively describe images. Re-captioning with tag injection and raw captions are used to enhance the data quality. A multi-turn dialogue system is implemented to enable the model to interactively modify its generation based on user input. The model is trained to understand multi-turn user dialogue and output new text prompts for image generation. The model is evaluated on four dimensions: text-image consistency, AI artifacts, subject clarity, and overall aesthetics. The evaluation protocol includes a hierarchical dataset with various difficulty levels and a multi-person correction process. The model achieves state-of-the-art performance in Chinese-to-image generation, outperforming other open-source models in text-image consistency, subject clarity, and aesthetics. It performs similarly to top closed-source models in subject clarity and aesthetics. The model supports long text understanding up to 256 tokens and can generate images using both Chinese and English text prompts.Hunyuan-DiT is a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. The model is designed with a new network architecture based on diffusion transformers, combining bilingual CLIP and multilingual T5 encoders to enhance language understanding and increase context length. A data pipeline is built to update and evaluate data for iterative model optimization. A multimodal large language model is used to refine image captions. Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through a holistic human evaluation protocol with over 50 professional evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared to other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT. Hunyuan-DiT supports multi-resolution training and inference, requiring appropriate positional encodings for different resolutions. Two types of positional encoding are used: Extended Positional Encoding and Centralized Interpolative Positional Encoding. The latter allows images with different resolutions to share similar positional encoding spaces, improving the efficiency of learning. To stabilize training, three techniques are used: QK-Norm, skip module normalization, and FP32 for certain operations. A data pipeline is built for data processing, including data acquisition, interpretation, layering, and application. The data category system covers a wide range of subjects and styles. A 'data convoy' mechanism is used to evaluate the impact of specialized data on the generative model. For fine-grained Chinese understanding, structural captions are used to comprehensively describe images. Re-captioning with tag injection and raw captions are used to enhance the data quality. A multi-turn dialogue system is implemented to enable the model to interactively modify its generation based on user input. The model is trained to understand multi-turn user dialogue and output new text prompts for image generation. The model is evaluated on four dimensions: text-image consistency, AI artifacts, subject clarity, and overall aesthetics. The evaluation protocol includes a hierarchical dataset with various difficulty levels and a multi-person correction process. The model achieves state-of-the-art performance in Chinese-to-image generation, outperforming other open-source models in text-image consistency, subject clarity, and aesthetics. It performs similarly to top closed-source models in subject clarity and aesthetics. The model supports long text understanding up to 256 tokens and can generate images using both Chinese and English text prompts.
Reach us at info@study.space
[slides and audio] Hunyuan-DiT%3A A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding