[slides and audio] ComboVerse%3A Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

**ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance** **Authors:** Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu **Abstract:** Generating high-quality 3D assets from a given image is crucial for applications like AR/VR. Recent advancements in single-image 3D generation have focused on feed-forward models that infer 3D models without optimization. However, these methods struggle with complex 3D assets containing multiple objects. This paper introduces ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. The key contributions are: 1. **Analysis of the "Multi-Object Gap":** The authors analyze the limitations of existing feed-forward models from both model and data perspectives, identifying biases in camera settings, dataset composition, and occlusion handling. 2. **Single-Object Reconstruction:** Each object in the input image is segmented and reconstructed using object inpainting techniques, ensuring accurate geometry and texture. 3. **Multi-Object Combination:** Pre-trained diffusion models are used to guide the positioning of objects, focusing on spatial alignment rather than content matching. This is achieved through spatially-aware score distillation sampling (SSDS), which emphasizes the spatial relationships between objects. **Methods:** - **Object Inpainting:** Segmented objects are inpainted using Stable Diffusion, with a bounding-aware mask to avoid artifacts. - **Spatially-Aware Diffusion Guidance:** The proposed SSDS loss reweights the attention map of spatial tokens, enhancing the accuracy of object placement. **Experiments:** - **Benchmark:** A benchmark of 100 images covering diverse complex scenes is used to evaluate ComboVerse. - **Comparison:** ComboVerse outperforms existing methods in terms of semantic similarity and user preference. - **Ablation Study:** The effectiveness of object inpainting and spatially-aware diffusion guidance is demonstrated through ablation studies. **Conclusion:** ComboVerse addresses the "multi-object gap" by combining object-level 3D generative models with spatially-aware guidance, achieving high-quality 3D asset creation. The method is particularly effective for scenes with multiple objects and complex occlusions.**ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance** **Authors:** Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, Ziwei Liu **Abstract:** Generating high-quality 3D assets from a given image is crucial for applications like AR/VR. Recent advancements in single-image 3D generation have focused on feed-forward models that infer 3D models without optimization. However, these methods struggle with complex 3D assets containing multiple objects. This paper introduces ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. The key contributions are: 1. **Analysis of the "Multi-Object Gap":** The authors analyze the limitations of existing feed-forward models from both model and data perspectives, identifying biases in camera settings, dataset composition, and occlusion handling. 2. **Single-Object Reconstruction:** Each object in the input image is segmented and reconstructed using object inpainting techniques, ensuring accurate geometry and texture. 3. **Multi-Object Combination:** Pre-trained diffusion models are used to guide the positioning of objects, focusing on spatial alignment rather than content matching. This is achieved through spatially-aware score distillation sampling (SSDS), which emphasizes the spatial relationships between objects. **Methods:** - **Object Inpainting:** Segmented objects are inpainted using Stable Diffusion, with a bounding-aware mask to avoid artifacts. - **Spatially-Aware Diffusion Guidance:** The proposed SSDS loss reweights the attention map of spatial tokens, enhancing the accuracy of object placement. **Experiments:** - **Benchmark:** A benchmark of 100 images covering diverse complex scenes is used to evaluate ComboVerse. - **Comparison:** ComboVerse outperforms existing methods in terms of semantic similarity and user preference. - **Ablation Study:** The effectiveness of object inpainting and spatially-aware diffusion guidance is demonstrated through ablation studies. **Conclusion:** ComboVerse addresses the "multi-object gap" by combining object-level 3D generative models with spatially-aware guidance, achieving high-quality 3D asset creation. The method is particularly effective for scenes with multiple objects and complex occlusions.

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

19 Mar 2024 | Yongwei Chen1*†, Tengfei Wang2*, Tong Wu3, Xingang Pan1, Kui Jia4, and Ziwei Liu1

19 Mar 2024 | Yongwei Chen1†, Tengfei Wang2, Tong Wu3, Xingang Pan1, Kui Jia4, and Ziwei Liu1