Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

14 May 2024 | Zhimin Li*, Jianwei Zhang*, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, Qinglin Lu†
Hunyuan-DiT is a powerful text-to-image diffusion transformer designed to understand both English and Chinese prompts. The model is constructed with a carefully designed transformer structure, text encoder, and positional encoding. It also features a comprehensive data pipeline for iterative model optimization. Hunyuan-DiT can generate high-quality, multi-resolution images based on fine-grained Chinese understanding, including ancient poetry, cuisine, and traditional styles. The model supports multi-turn dialogue, allowing users to refine images according to context. Through a holistic human evaluation protocol involving over 50 professional evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation, outperforming other open-source models in text-image consistency, subject clarity, and aesthetics. The code and pre-trained models are publicly available on GitHub.Hunyuan-DiT is a powerful text-to-image diffusion transformer designed to understand both English and Chinese prompts. The model is constructed with a carefully designed transformer structure, text encoder, and positional encoding. It also features a comprehensive data pipeline for iterative model optimization. Hunyuan-DiT can generate high-quality, multi-resolution images based on fine-grained Chinese understanding, including ancient poetry, cuisine, and traditional styles. The model supports multi-turn dialogue, allowing users to refine images according to context. Through a holistic human evaluation protocol involving over 50 professional evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation, outperforming other open-source models in text-image consistency, subject clarity, and aesthetics. The code and pre-trained models are publicly available on GitHub.
Reach us at info@study.space