ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance

19 Mar 2024 | Yongwei Chen, Tengfei Wang, Tong Wu, Xingang Pan, Kui Jia, and Ziwei Liu
ComboVerse is a novel framework for generating high-quality 3D assets with complex compositions from a single image. It addresses the "multi-object gap" in existing methods, which struggle to generate accurate 3D models containing multiple objects. The framework consists of two stages: single-object reconstruction and multi-object combination. In the first stage, each object is decomposed and reconstructed individually using occlusion removal and image-to-3D modeling. In the second stage, the 3D models are combined by adjusting their sizes, rotations, and positions to match the input image. To guide this process, spatially-aware score distillation sampling (SSDS) is applied, which emphasizes spatial relationships between objects. This approach improves the accuracy of object placement compared to standard score distillation sampling. Extensive experiments show that ComboVerse achieves significant improvements in generating compositional 3D assets compared to existing methods. The framework is evaluated on a benchmark of 100 images with diverse complex scenes, demonstrating its effectiveness in handling multiple objects, occlusion, and camera settings. The main contributions include proposing ComboVerse, analyzing the "multi-object gap" from both model and data perspectives, and introducing spatially-aware diffusion guidance for object placement. The method outperforms existing approaches in both qualitative and quantitative evaluations, showing superior performance in generating realistic 3D assets.ComboVerse is a novel framework for generating high-quality 3D assets with complex compositions from a single image. It addresses the "multi-object gap" in existing methods, which struggle to generate accurate 3D models containing multiple objects. The framework consists of two stages: single-object reconstruction and multi-object combination. In the first stage, each object is decomposed and reconstructed individually using occlusion removal and image-to-3D modeling. In the second stage, the 3D models are combined by adjusting their sizes, rotations, and positions to match the input image. To guide this process, spatially-aware score distillation sampling (SSDS) is applied, which emphasizes spatial relationships between objects. This approach improves the accuracy of object placement compared to standard score distillation sampling. Extensive experiments show that ComboVerse achieves significant improvements in generating compositional 3D assets compared to existing methods. The framework is evaluated on a benchmark of 100 images with diverse complex scenes, demonstrating its effectiveness in handling multiple objects, occlusion, and camera settings. The main contributions include proposing ComboVerse, analyzing the "multi-object gap" from both model and data perspectives, and introducing spatially-aware diffusion guidance for object placement. The method outperforms existing approaches in both qualitative and quantitative evaluations, showing superior performance in generating realistic 3D assets.
Reach us at info@study.space