[slides and audio] CosmicMan%3A A Text-to-Image Foundation Model for Humans

CosmicMan is a text-to-image foundation model specialized for generating high-fidelity human images. Unlike general-purpose models, CosmicMan addresses the challenges of inferior quality and text-image misalignment in human generation. It introduces a new data production paradigm, Annotate Anyone, which continuously produces high-quality data with accurate yet cost-effective annotations. This approach leads to the creation of CosmicMan-HQ 1.0, a large-scale dataset with 6 million high-resolution human images and 115 million detailed attributes. The model also proposes a training framework called Daring, which decomposes text descriptions and image pixels to enhance text-image alignment. Daring uses a new loss function, HOLA, to enforce attention refocusing in specific spatial regions related to human body structure and outfit arrangement. The model outperforms state-of-the-art text-to-image models in image quality and text-image alignment, demonstrating its effectiveness in 2D and 3D human generation tasks. CosmicMan-HQ is released for research use, providing a foundation for human-centric content generation. The model is designed to be easily integrated into downstream tasks and supports long-term research with continuous updates. The framework and dataset are expected to advance the field of human-centric image generation.CosmicMan is a text-to-image foundation model specialized for generating high-fidelity human images. Unlike general-purpose models, CosmicMan addresses the challenges of inferior quality and text-image misalignment in human generation. It introduces a new data production paradigm, Annotate Anyone, which continuously produces high-quality data with accurate yet cost-effective annotations. This approach leads to the creation of CosmicMan-HQ 1.0, a large-scale dataset with 6 million high-resolution human images and 115 million detailed attributes. The model also proposes a training framework called Daring, which decomposes text descriptions and image pixels to enhance text-image alignment. Daring uses a new loss function, HOLA, to enforce attention refocusing in specific spatial regions related to human body structure and outfit arrangement. The model outperforms state-of-the-art text-to-image models in image quality and text-image alignment, demonstrating its effectiveness in 2D and 3D human generation tasks. CosmicMan-HQ is released for research use, providing a foundation for human-centric content generation. The model is designed to be easily integrated into downstream tasks and supports long-term research with continuous updates. The framework and dataset are expected to advance the field of human-centric image generation.

CosmicMan: A Text-to-Image Foundation Model for Humans

1 Apr 2024 | Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin†, Wayne Wu†

CosmicMan: A Text-to-Image Foundation Model for Humans

1 Apr 2024 | Shikai Li*, Jianglin Fu*, Kaiyuan Liu*, Wentao Wang*, Kwan-Yee Lin†, Wayne Wu†

1 Apr 2024 | Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin†, Wayne Wu†