MVD-Fusion is a method for single-view 3D inference that generates multi-view RGB-D images using a depth-guided attention mechanism to enforce multi-view consistency. Unlike recent methods that rely on distillation processes to generate 3D outputs from novel-view generations, MVD-Fusion directly models the joint distribution over multiple views by training a denoising diffusion model. The model leverages intermediate noisy depth estimates to maintain multi-view consistency through reprojection-based conditioning. The approach is trained on large-scale synthetic datasets (Objaverse) and real-world datasets (CO3D) and demonstrates superior performance in terms of accuracy and diversity compared to state-of-the-art methods. MVD-Fusion also provides a more accurate representation of the geometry induced by the synthesized depth images, making it a complementary advance to existing techniques. The method is evaluated on various datasets, including Objaverse, Google Scanned Objects, and CO3D, showing consistent improvements in novel view synthesis and 3D reconstruction tasks.MVD-Fusion is a method for single-view 3D inference that generates multi-view RGB-D images using a depth-guided attention mechanism to enforce multi-view consistency. Unlike recent methods that rely on distillation processes to generate 3D outputs from novel-view generations, MVD-Fusion directly models the joint distribution over multiple views by training a denoising diffusion model. The model leverages intermediate noisy depth estimates to maintain multi-view consistency through reprojection-based conditioning. The approach is trained on large-scale synthetic datasets (Objaverse) and real-world datasets (CO3D) and demonstrates superior performance in terms of accuracy and diversity compared to state-of-the-art methods. MVD-Fusion also provides a more accurate representation of the geometry induced by the synthesized depth images, making it a complementary advance to existing techniques. The method is evaluated on various datasets, including Objaverse, Google Scanned Objects, and CO3D, showing consistent improvements in novel view synthesis and 3D reconstruction tasks.