MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

4 Apr 2024 | Hanzhe Hu, Zhizhuo Zhou, Varun Jampani, Shubham Tulsiani
MVD-Fusion is a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. Unlike previous methods that require a distillation process to generate 3D outputs, MVD-Fusion directly generates mutually-consistent multiple views by leveraging depth information to enforce consistency. The method uses a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image, and leverages intermediate noisy depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. The model is trained on large-scale synthetic dataset Obajverse and real-world CO3D dataset. The approach outperforms recent state-of-the-art methods in terms of accuracy and generates more accurate 3D predictions. The method also demonstrates the ability to generate diverse outputs and generalize to in-the-wild generic objects. The approach is based on the idea that depth information can provide a mechanism for enforcing multi-view consistency. The method uses a depth-guided attention mechanism to ensure that generated images are geometrically consistent. The model is trained using a large-scale synthetic dataset and evaluated on both synthetic and real-world objects. The results show that MVD-Fusion achieves more accurate view synthesis compared to state-of-the-art methods and yields better 3D predictions compared to prior direct 3D inference methods. The method is also able to generate diverse outputs and generalize to in-the-wild generic objects. The approach is compared with other methods such as Zero-1-to-3 and SyncDreamer, and is found to generate more plausible outputs and be more faithful to details in the input. The method is also able to handle arbitrary camera poses in real datasets such as CO3D. The results show that MVD-Fusion achieves consistent improvements over baselines across metrics on both the in-distribution Objaverse dataset and the out-of-distribution GSO dataset. The method is also able to generate consistent novel-view images and reasonable 3D shapes from single-view observation. The approach is able to preserve the rich texture in the input images and model the rough geometry without post-processing. The method is able to generate accurate and realistic novel views on real-world datasets with perspective camera poses. The results show that MVD-Fusion outperforms other methods in terms of perceptual metrics such as SSIM and LPIPS. The method is able to generate multi-view consistent images with better alignment with the image input and more plausible completions in unobserved regions. The approach is able to handle challenging out-of-domain images as input and still generate consistent novel-view images and reasonable 3D shapes from single-view observation. The method is able to generate diverse samples given the same input and is able to handle arbitrary camera poses in real datasets. The results show that MVD-Fusion is able to generate more accurate and realistic novel views compared to other methods. The method is able to generate consistent multi-view images and is able to handle challenging scenarios such as cluttered scenes withMVD-Fusion is a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. Unlike previous methods that require a distillation process to generate 3D outputs, MVD-Fusion directly generates mutually-consistent multiple views by leveraging depth information to enforce consistency. The method uses a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image, and leverages intermediate noisy depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. The model is trained on large-scale synthetic dataset Obajverse and real-world CO3D dataset. The approach outperforms recent state-of-the-art methods in terms of accuracy and generates more accurate 3D predictions. The method also demonstrates the ability to generate diverse outputs and generalize to in-the-wild generic objects. The approach is based on the idea that depth information can provide a mechanism for enforcing multi-view consistency. The method uses a depth-guided attention mechanism to ensure that generated images are geometrically consistent. The model is trained using a large-scale synthetic dataset and evaluated on both synthetic and real-world objects. The results show that MVD-Fusion achieves more accurate view synthesis compared to state-of-the-art methods and yields better 3D predictions compared to prior direct 3D inference methods. The method is also able to generate diverse outputs and generalize to in-the-wild generic objects. The approach is compared with other methods such as Zero-1-to-3 and SyncDreamer, and is found to generate more plausible outputs and be more faithful to details in the input. The method is also able to handle arbitrary camera poses in real datasets such as CO3D. The results show that MVD-Fusion achieves consistent improvements over baselines across metrics on both the in-distribution Objaverse dataset and the out-of-distribution GSO dataset. The method is also able to generate consistent novel-view images and reasonable 3D shapes from single-view observation. The approach is able to preserve the rich texture in the input images and model the rough geometry without post-processing. The method is able to generate accurate and realistic novel views on real-world datasets with perspective camera poses. The results show that MVD-Fusion outperforms other methods in terms of perceptual metrics such as SSIM and LPIPS. The method is able to generate multi-view consistent images with better alignment with the image input and more plausible completions in unobserved regions. The approach is able to handle challenging out-of-domain images as input and still generate consistent novel-view images and reasonable 3D shapes from single-view observation. The method is able to generate diverse samples given the same input and is able to handle arbitrary camera poses in real datasets. The results show that MVD-Fusion is able to generate more accurate and realistic novel views compared to other methods. The method is able to generate consistent multi-view images and is able to handle challenging scenarios such as cluttered scenes with
Reach us at info@study.space
[slides] MVD-Fusion%3A Single-view 3D via Depth-consistent Multi-view Generation | StudySpace