CosmicMan: A Text-to-Image Foundation Model for Humans

CosmicMan: A Text-to-Image Foundation Model for Humans

1 Apr 2024 | Shikai Li*, Jianglin Fu*, Kaiyuan Liu*, Wentao Wang*, Kwan-Yee Lin†, Wayne Wu†
**CosmicMan: A Text-to-Image Foundation Model for Humans** **Authors:** Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu **Institution:** Shanghai AI Laboratory **Abstract:** CosmicMan is a specialized text-to-image foundation model designed to generate high-fidelity human images with meticulous appearance, reasonable structure, and precise text-image alignment. Unlike general-purpose models that struggle with inferior quality and text-image misalignment, CosmicMan addresses these issues by leveraging a new data production paradigm, Annotate Anyone, and a decomposed training framework called Daring. **Key Contributions:** 1. **Annotate Anyone:** A new data production paradigm that combines human-AI cooperation to continuously produce high-quality, cost-effective data. It includes two main stages: Flowing Data Sourcing and Human-in-the-loop Data Annotation. 2. **CosmicMan-HQ 1.0:** A large-scale dataset with 6 million high-quality real-world human images (mean resolution: 1488 × 1255) and rich annotations (115 million attributes). 3. **Daring (Decomposed-Attention-Refocusing):** A training framework that decomposes dense text descriptions into groups aligned with human body structure and enforces attention refocusing without adding extra modules. **Methods:** - **Annotate Anyone:** Combines AI and human expertise to continuously produce high-quality, diverse human images. - **CosmicMan-HQ 1.0:** Largest human-centric dataset with rich annotations, including 115 million attributes. - **Daring:** Enhances text-image alignment by explicitly discretizing dense text descriptions and enforcing attention refocusing. **Experiments:** - **Quantitative Evaluation:** CosmicMan outperforms state-of-the-art models in image quality and fine-grained text-image alignment. - **Human Preference Evaluation:** Users prefer CosmicMan's results in both image quality and text-image alignment. - **Ablation Study:** Demonstrates the effectiveness of the proposed dataset and training strategy. - **Applications:** Shows superior performance in 2D human editing and 3D human reconstruction tasks. **Discussion:** - **Release and Future Work:** The team plans to continuously update the dataset and provide periodic releases of the model to support human-centric content generation research. **Conclusion:** CosmicMan addresses the gap in human-centric content generation by providing a high-quality, specialized foundation model that generates realistic human images with precise text-image alignment.**CosmicMan: A Text-to-Image Foundation Model for Humans** **Authors:** Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu **Institution:** Shanghai AI Laboratory **Abstract:** CosmicMan is a specialized text-to-image foundation model designed to generate high-fidelity human images with meticulous appearance, reasonable structure, and precise text-image alignment. Unlike general-purpose models that struggle with inferior quality and text-image misalignment, CosmicMan addresses these issues by leveraging a new data production paradigm, Annotate Anyone, and a decomposed training framework called Daring. **Key Contributions:** 1. **Annotate Anyone:** A new data production paradigm that combines human-AI cooperation to continuously produce high-quality, cost-effective data. It includes two main stages: Flowing Data Sourcing and Human-in-the-loop Data Annotation. 2. **CosmicMan-HQ 1.0:** A large-scale dataset with 6 million high-quality real-world human images (mean resolution: 1488 × 1255) and rich annotations (115 million attributes). 3. **Daring (Decomposed-Attention-Refocusing):** A training framework that decomposes dense text descriptions into groups aligned with human body structure and enforces attention refocusing without adding extra modules. **Methods:** - **Annotate Anyone:** Combines AI and human expertise to continuously produce high-quality, diverse human images. - **CosmicMan-HQ 1.0:** Largest human-centric dataset with rich annotations, including 115 million attributes. - **Daring:** Enhances text-image alignment by explicitly discretizing dense text descriptions and enforcing attention refocusing. **Experiments:** - **Quantitative Evaluation:** CosmicMan outperforms state-of-the-art models in image quality and fine-grained text-image alignment. - **Human Preference Evaluation:** Users prefer CosmicMan's results in both image quality and text-image alignment. - **Ablation Study:** Demonstrates the effectiveness of the proposed dataset and training strategy. - **Applications:** Shows superior performance in 2D human editing and 3D human reconstruction tasks. **Discussion:** - **Release and Future Work:** The team plans to continuously update the dataset and provide periodic releases of the model to support human-centric content generation research. **Conclusion:** CosmicMan addresses the gap in human-centric content generation by providing a high-quality, specialized foundation model that generates realistic human images with precise text-image alignment.
Reach us at info@study.space
Understanding CosmicMan%3A A Text-to-Image Foundation Model for Humans