MultiDiff: Consistent Novel View Synthesis from a Single Image

MultiDiff: Consistent Novel View Synthesis from a Single Image

26 Jun 2024 | Norman Müller, Katja Schwarz, Barbara Roessler, Lorenzo Porzi, Samuel Rota Bulò, Matthias Nießner, Peter Kontschieder
MultiDiff is a novel approach for consistent novel view synthesis from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed due to the multiple plausible explanations for unobserved areas. To address this, MultiDiff incorporates strong priors in the form of monocular depth predictors and video-diffusion models. Monocular depth enables the model to condition on warped reference images for the target views, enhancing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. Unlike autoregressive methods that are prone to error accumulation, MultiDiff jointly synthesizes a sequence of frames, yielding high-quality and multi-view consistent results even for long-term scene generation with large camera movements. Additionally, MultiDiff introduces a structured noise distribution to improve multi-view consistency. Experimental results on the RealEstate10K and ScanNet datasets demonstrate that MultiDiff outperforms state-of-the-art methods in terms of image quality and consistency. The model also supports multi-view consistent editing without additional tuning.MultiDiff is a novel approach for consistent novel view synthesis from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed due to the multiple plausible explanations for unobserved areas. To address this, MultiDiff incorporates strong priors in the form of monocular depth predictors and video-diffusion models. Monocular depth enables the model to condition on warped reference images for the target views, enhancing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. Unlike autoregressive methods that are prone to error accumulation, MultiDiff jointly synthesizes a sequence of frames, yielding high-quality and multi-view consistent results even for long-term scene generation with large camera movements. Additionally, MultiDiff introduces a structured noise distribution to improve multi-view consistency. Experimental results on the RealEstate10K and ScanNet datasets demonstrate that MultiDiff outperforms state-of-the-art methods in terms of image quality and consistency. The model also supports multi-view consistent editing without additional tuning.
Reach us at info@study.space
Understanding MultiDiff%3A Consistent Novel View Synthesis from a Single Image