**CosmicMan: A Text-to-Image Foundation Model for Humans**
**Authors:** Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu
**Institution:** Shanghai AI Laboratory
**Abstract:**
CosmicMan is a specialized text-to-image foundation model designed to generate high-fidelity human images with meticulous appearance, reasonable structure, and precise text-image alignment. Unlike general-purpose models that struggle with inferior quality and text-image misalignment, CosmicMan addresses these issues by leveraging a new data production paradigm, Annotate Anyone, and a decomposed training framework called Daring.
**Key Contributions:**
1. **Annotate Anyone:** A new data production paradigm that combines human-AI cooperation to continuously produce high-quality, cost-effective data. It includes two main stages: Flowing Data Sourcing and Human-in-the-loop Data Annotation.
2. **CosmicMan-HQ 1.0:** A large-scale dataset with 6 million high-quality real-world human images (mean resolution: 1488 × 1255) and rich annotations (115 million attributes).
3. **Daring (Decomposed-Attention-Refocusing):** A training framework that decomposes dense text descriptions into groups aligned with human body structure and enforces attention refocusing without adding extra modules.
**Methods:**
- **Annotate Anyone:** Combines AI and human expertise to continuously produce high-quality, diverse human images.
- **CosmicMan-HQ 1.0:** Largest human-centric dataset with rich annotations, including 115 million attributes.
- **Daring:** Enhances text-image alignment by explicitly discretizing dense text descriptions and enforcing attention refocusing.
**Experiments:**
- **Quantitative Evaluation:** CosmicMan outperforms state-of-the-art models in image quality and fine-grained text-image alignment.
- **Human Preference Evaluation:** Users prefer CosmicMan's results in both image quality and text-image alignment.
- **Ablation Study:** Demonstrates the effectiveness of the proposed dataset and training strategy.
- **Applications:** Shows superior performance in 2D human editing and 3D human reconstruction tasks.
**Discussion:**
- **Release and Future Work:** The team plans to continuously update the dataset and provide periodic releases of the model to support human-centric content generation research.
**Conclusion:**
CosmicMan addresses the gap in human-centric content generation by providing a high-quality, specialized foundation model that generates realistic human images with precise text-image alignment.**CosmicMan: A Text-to-Image Foundation Model for Humans**
**Authors:** Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu
**Institution:** Shanghai AI Laboratory
**Abstract:**
CosmicMan is a specialized text-to-image foundation model designed to generate high-fidelity human images with meticulous appearance, reasonable structure, and precise text-image alignment. Unlike general-purpose models that struggle with inferior quality and text-image misalignment, CosmicMan addresses these issues by leveraging a new data production paradigm, Annotate Anyone, and a decomposed training framework called Daring.
**Key Contributions:**
1. **Annotate Anyone:** A new data production paradigm that combines human-AI cooperation to continuously produce high-quality, cost-effective data. It includes two main stages: Flowing Data Sourcing and Human-in-the-loop Data Annotation.
2. **CosmicMan-HQ 1.0:** A large-scale dataset with 6 million high-quality real-world human images (mean resolution: 1488 × 1255) and rich annotations (115 million attributes).
3. **Daring (Decomposed-Attention-Refocusing):** A training framework that decomposes dense text descriptions into groups aligned with human body structure and enforces attention refocusing without adding extra modules.
**Methods:**
- **Annotate Anyone:** Combines AI and human expertise to continuously produce high-quality, diverse human images.
- **CosmicMan-HQ 1.0:** Largest human-centric dataset with rich annotations, including 115 million attributes.
- **Daring:** Enhances text-image alignment by explicitly discretizing dense text descriptions and enforcing attention refocusing.
**Experiments:**
- **Quantitative Evaluation:** CosmicMan outperforms state-of-the-art models in image quality and fine-grained text-image alignment.
- **Human Preference Evaluation:** Users prefer CosmicMan's results in both image quality and text-image alignment.
- **Ablation Study:** Demonstrates the effectiveness of the proposed dataset and training strategy.
- **Applications:** Shows superior performance in 2D human editing and 3D human reconstruction tasks.
**Discussion:**
- **Release and Future Work:** The team plans to continuously update the dataset and provide periodic releases of the model to support human-centric content generation research.
**Conclusion:**
CosmicMan addresses the gap in human-centric content generation by providing a high-quality, specialized foundation model that generates realistic human images with precise text-image alignment.