[slides] Hunyuan-DiT%3A A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Hunyuan-DiT is a powerful text-to-image diffusion transformer designed to understand both English and Chinese prompts. The model is constructed with a carefully designed transformer structure, text encoder, and positional encoding. It also features a comprehensive data pipeline for iterative model optimization. Hunyuan-DiT can generate high-quality, multi-resolution images based on fine-grained Chinese understanding, including ancient poetry, cuisine, and traditional styles. The model supports multi-turn dialogue, allowing users to refine images according to context. Through a holistic human evaluation protocol involving over 50 professional evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation, outperforming other open-source models in text-image consistency, subject clarity, and aesthetics. The code and pre-trained models are publicly available on GitHub.Hunyuan-DiT is a powerful text-to-image diffusion transformer designed to understand both English and Chinese prompts. The model is constructed with a carefully designed transformer structure, text encoder, and positional encoding. It also features a comprehensive data pipeline for iterative model optimization. Hunyuan-DiT can generate high-quality, multi-resolution images based on fine-grained Chinese understanding, including ancient poetry, cuisine, and traditional styles. The model supports multi-turn dialogue, allowing users to refine images according to context. Through a holistic human evaluation protocol involving over 50 professional evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation, outperforming other open-source models in text-image consistency, subject clarity, and aesthetics. The code and pre-trained models are publicly available on GitHub.

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding