MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

4 Apr 2024 | Hanzhe Hu1*, Zhizhuo Zhou2*, Varun Jampani3, Shubham Tulsiani 1
MVD-Fusion is a method for single-view 3D inference that generates multi-view RGB-D images using a depth-guided attention mechanism to enforce multi-view consistency. Unlike recent methods that rely on distillation processes to generate 3D outputs from novel-view generations, MVD-Fusion directly models the joint distribution over multiple views by training a denoising diffusion model. The model leverages intermediate noisy depth estimates to maintain multi-view consistency through reprojection-based conditioning. The approach is trained on large-scale synthetic datasets (Objaverse) and real-world datasets (CO3D) and demonstrates superior performance in terms of accuracy and diversity compared to state-of-the-art methods. MVD-Fusion also provides a more accurate representation of the geometry induced by the synthesized depth images, making it a complementary advance to existing techniques. The method is evaluated on various datasets, including Objaverse, Google Scanned Objects, and CO3D, showing consistent improvements in novel view synthesis and 3D reconstruction tasks.MVD-Fusion is a method for single-view 3D inference that generates multi-view RGB-D images using a depth-guided attention mechanism to enforce multi-view consistency. Unlike recent methods that rely on distillation processes to generate 3D outputs from novel-view generations, MVD-Fusion directly models the joint distribution over multiple views by training a denoising diffusion model. The model leverages intermediate noisy depth estimates to maintain multi-view consistency through reprojection-based conditioning. The approach is trained on large-scale synthetic datasets (Objaverse) and real-world datasets (CO3D) and demonstrates superior performance in terms of accuracy and diversity compared to state-of-the-art methods. MVD-Fusion also provides a more accurate representation of the geometry induced by the synthesized depth images, making it a complementary advance to existing techniques. The method is evaluated on various datasets, including Objaverse, Google Scanned Objects, and CO3D, showing consistent improvements in novel view synthesis and 3D reconstruction tasks.
Reach us at info@study.space
Understanding MVD-Fusion%3A Single-view 3D via Depth-consistent Multi-view Generation