11 Mar 2024 | Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu
V3D: Video Diffusion Models are Effective 3D Generators
**Abstract:**
Automatic 3D generation has gained significant attention, with recent methods accelerating generation speed but often producing less detailed objects due to limited model capacity or 3D data. Inspired by advancements in video diffusion models, V3D leverages the world simulation capacity of pre-trained video diffusion models to enhance 3D generation. By introducing geometrical consistency priors and extending the video diffusion model to a multi-view consistent 3D generator, V3D can generate high-fidelity 3D objects within 3 minutes. The method is extended to scene-level novel view synthesis, achieving precise control over camera paths with sparse input views. Extensive experiments demonstrate superior performance in generation quality and multi-view consistency.
**Keywords:**
Video Diffusion Models $\cdot$ Single Image to 3D $\cdot$ Novel View Synthesis
**Introduction:**
Recent advancements in video diffusion models have enabled the generation of intricate scenes and complex dynamics with great spatio-temporal consistency. V3D utilizes these models to generate consistent multi-view images, which are then used for 3D reconstruction. For object-centric 3D generation, V3D fine-tunes a base video diffusion model on 360° orbit videos of synthetic 3D objects, generating reliable multi-views for reconstruction. For scene-level novel view synthesis, V3D integrates a PixelNeRF encoder to control camera poses and accommodate multiple input images. The method achieves state-of-the-art performance in both object-centric and scene-level 3D generation, demonstrating its effectiveness and efficiency.
**Related Work:**
Previous work in 3D generation includes optimization-based methods and non-optimization paradigms. Video diffusion models have shown promise in generating dense multi-view images, but current methods often suffer from limitations such as limited view numbers and memory consumption. V3D addresses these issues by leveraging large-scale pre-trained video diffusion models and incorporating geometrical consistency priors.
**Approach:**
V3D treats dense multi-view synthesis as video generation, leveraging the structure and priors of pre-trained video diffusion models. For object-centric 3D generation, V3D fine-tunes the model on 360° orbit videos of synthetic 3D objects and designs a reconstruction pipeline tailored for generated multi-views. For scene-level novel view synthesis, V3D enhances the base model with a PixelNeRF encoder to control camera poses and accommodate multiple input images.
**Experiments:**
V3D is evaluated on both object-centric and scene-level 3D generation tasks, showing superior performance in terms of generation quality and multi-view consistency. User studies and quantitative comparisons further validate the effectiveness of V3D.
**Conclusion:**
V3D is a novel method for generating consistent multi-view images using video diffusion models. By fine-tuning pre-trained models and incorporating geometrical consistency priors, V3D achievesV3D: Video Diffusion Models are Effective 3D Generators
**Abstract:**
Automatic 3D generation has gained significant attention, with recent methods accelerating generation speed but often producing less detailed objects due to limited model capacity or 3D data. Inspired by advancements in video diffusion models, V3D leverages the world simulation capacity of pre-trained video diffusion models to enhance 3D generation. By introducing geometrical consistency priors and extending the video diffusion model to a multi-view consistent 3D generator, V3D can generate high-fidelity 3D objects within 3 minutes. The method is extended to scene-level novel view synthesis, achieving precise control over camera paths with sparse input views. Extensive experiments demonstrate superior performance in generation quality and multi-view consistency.
**Keywords:**
Video Diffusion Models $\cdot$ Single Image to 3D $\cdot$ Novel View Synthesis
**Introduction:**
Recent advancements in video diffusion models have enabled the generation of intricate scenes and complex dynamics with great spatio-temporal consistency. V3D utilizes these models to generate consistent multi-view images, which are then used for 3D reconstruction. For object-centric 3D generation, V3D fine-tunes a base video diffusion model on 360° orbit videos of synthetic 3D objects, generating reliable multi-views for reconstruction. For scene-level novel view synthesis, V3D integrates a PixelNeRF encoder to control camera poses and accommodate multiple input images. The method achieves state-of-the-art performance in both object-centric and scene-level 3D generation, demonstrating its effectiveness and efficiency.
**Related Work:**
Previous work in 3D generation includes optimization-based methods and non-optimization paradigms. Video diffusion models have shown promise in generating dense multi-view images, but current methods often suffer from limitations such as limited view numbers and memory consumption. V3D addresses these issues by leveraging large-scale pre-trained video diffusion models and incorporating geometrical consistency priors.
**Approach:**
V3D treats dense multi-view synthesis as video generation, leveraging the structure and priors of pre-trained video diffusion models. For object-centric 3D generation, V3D fine-tunes the model on 360° orbit videos of synthetic 3D objects and designs a reconstruction pipeline tailored for generated multi-views. For scene-level novel view synthesis, V3D enhances the base model with a PixelNeRF encoder to control camera poses and accommodate multiple input images.
**Experiments:**
V3D is evaluated on both object-centric and scene-level 3D generation tasks, showing superior performance in terms of generation quality and multi-view consistency. User studies and quantitative comparisons further validate the effectiveness of V3D.
**Conclusion:**
V3D is a novel method for generating consistent multi-view images using video diffusion models. By fine-tuning pre-trained models and incorporating geometrical consistency priors, V3D achieves