MultiDiff: Consistent Novel View Synthesis from a Single Image

MultiDiff: Consistent Novel View Synthesis from a Single Image

26 Jun 2024 | Norman Müller¹, Katja Schwarz¹, Barbara Roessle², Lorenzo Porzi¹, Samuel Rota Bulò¹, Matthias Nießner², Peter Kontschieder¹
MultiDiff is a novel approach for consistent novel view synthesis from a single image. The method leverages strong priors, including monocular depth predictors and video-diffusion models, to generate consistent and high-quality views along a desired camera trajectory. By incorporating geometric stability through warped reference images and a structured noise distribution, MultiDiff achieves multi-view consistency and high image quality. The model is trained to synthesize sequences of frames, reducing inference time by an order of magnitude and enabling natural multi-view consistent editing without further tuning. Experimental results on RealEstate10K and ScanNet datasets show that MultiDiff outperforms state-of-the-art methods in terms of image fidelity and consistency. The model supports long-term scene generation with large camera movements and provides more realistic and view-consistent results compared to baselines. MultiDiff also demonstrates strong performance in ablation studies, highlighting the importance of priors and structured noise in achieving consistent synthesis. The method is capable of generating novel views from a single input image, without requiring any geometric information about the target views. The model's ability to generate consistent views across large viewpoint changes and its efficient inference process make it a promising solution for view extrapolation from a single image.MultiDiff is a novel approach for consistent novel view synthesis from a single image. The method leverages strong priors, including monocular depth predictors and video-diffusion models, to generate consistent and high-quality views along a desired camera trajectory. By incorporating geometric stability through warped reference images and a structured noise distribution, MultiDiff achieves multi-view consistency and high image quality. The model is trained to synthesize sequences of frames, reducing inference time by an order of magnitude and enabling natural multi-view consistent editing without further tuning. Experimental results on RealEstate10K and ScanNet datasets show that MultiDiff outperforms state-of-the-art methods in terms of image fidelity and consistency. The model supports long-term scene generation with large camera movements and provides more realistic and view-consistent results compared to baselines. MultiDiff also demonstrates strong performance in ablation studies, highlighting the importance of priors and structured noise in achieving consistent synthesis. The method is capable of generating novel views from a single input image, without requiring any geometric information about the target views. The model's ability to generate consistent views across large viewpoint changes and its efficient inference process make it a promising solution for view extrapolation from a single image.
Reach us at info@study.space