19 Mar 2024 | Yongwei Chen1*†, Tengfei Wang2*, Tong Wu3, Xingang Pan1, Kui Jia4, and Ziwei Liu1
**ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance**
**Authors:** Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu
**Abstract:**
Generating high-quality 3D assets from a given image is crucial for applications like AR/VR. Recent advancements in single-image 3D generation have focused on feed-forward models that infer 3D models without optimization. However, these methods struggle with complex 3D assets containing multiple objects. This paper introduces ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. The key contributions are:
1. **Analysis of the "Multi-Object Gap":** The authors analyze the limitations of existing feed-forward models from both model and data perspectives, identifying biases in camera settings, dataset composition, and occlusion handling.
2. **Single-Object Reconstruction:** Each object in the input image is segmented and reconstructed using object inpainting techniques, ensuring accurate geometry and texture.
3. **Multi-Object Combination:** Pre-trained diffusion models are used to guide the positioning of objects, focusing on spatial alignment rather than content matching. This is achieved through spatially-aware score distillation sampling (SSDS), which emphasizes the spatial relationships between objects.
**Methods:**
- **Object Inpainting:** Segmented objects are inpainted using Stable Diffusion, with a bounding-aware mask to avoid artifacts.
- **Spatially-Aware Diffusion Guidance:** The proposed SSDS loss reweights the attention map of spatial tokens, enhancing the accuracy of object placement.
**Experiments:**
- **Benchmark:** A benchmark of 100 images covering diverse complex scenes is used to evaluate ComboVerse.
- **Comparison:** ComboVerse outperforms existing methods in terms of semantic similarity and user preference.
- **Ablation Study:** The effectiveness of object inpainting and spatially-aware diffusion guidance is demonstrated through ablation studies.
**Conclusion:**
ComboVerse addresses the "multi-object gap" by combining object-level 3D generative models with spatially-aware guidance, achieving high-quality 3D asset creation. The method is particularly effective for scenes with multiple objects and complex occlusions.**ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance**
**Authors:** Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu
**Abstract:**
Generating high-quality 3D assets from a given image is crucial for applications like AR/VR. Recent advancements in single-image 3D generation have focused on feed-forward models that infer 3D models without optimization. However, these methods struggle with complex 3D assets containing multiple objects. This paper introduces ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. The key contributions are:
1. **Analysis of the "Multi-Object Gap":** The authors analyze the limitations of existing feed-forward models from both model and data perspectives, identifying biases in camera settings, dataset composition, and occlusion handling.
2. **Single-Object Reconstruction:** Each object in the input image is segmented and reconstructed using object inpainting techniques, ensuring accurate geometry and texture.
3. **Multi-Object Combination:** Pre-trained diffusion models are used to guide the positioning of objects, focusing on spatial alignment rather than content matching. This is achieved through spatially-aware score distillation sampling (SSDS), which emphasizes the spatial relationships between objects.
**Methods:**
- **Object Inpainting:** Segmented objects are inpainted using Stable Diffusion, with a bounding-aware mask to avoid artifacts.
- **Spatially-Aware Diffusion Guidance:** The proposed SSDS loss reweights the attention map of spatial tokens, enhancing the accuracy of object placement.
**Experiments:**
- **Benchmark:** A benchmark of 100 images covering diverse complex scenes is used to evaluate ComboVerse.
- **Comparison:** ComboVerse outperforms existing methods in terms of semantic similarity and user preference.
- **Ablation Study:** The effectiveness of object inpainting and spatially-aware diffusion guidance is demonstrated through ablation studies.
**Conclusion:**
ComboVerse addresses the "multi-object gap" by combining object-level 3D generative models with spatially-aware guidance, achieving high-quality 3D asset creation. The method is particularly effective for scenes with multiple objects and complex occlusions.