18 Jul 2024 | Junlin Han1,2*, Filippos Kokkinos1*, and Philip Torr2
This paper introduces VFusion3D, a novel method for building scalable 3D generative models using pre-trained video diffusion models. The primary challenge in developing foundation 3D generative models is the limited availability of 3D data. To address this, the authors propose using a video diffusion model, EMU Video, trained with extensive text, image, and video data, as a knowledge source for 3D data. By fine-tuning this model to generate multi-view videos, they create a large-scale synthetic multi-view dataset to train VFusion3D. VFusion3D can generate high-quality 3D assets from a single image in seconds, outperforming current state-of-the-art feed-forward 3D generative models with over 90% user preference. The paper details the pipeline, training strategies, and ablation studies, demonstrating the effectiveness of VFusion3D in both single image 3D reconstruction and text-to-3D generation tasks. The results show superior performance in text alignment, image faithfulness, and visual quality, with a user study confirming the model's high quality and faithfulness. The paper also explores the scaling trends of 3D generative models and discusses the benefits and limitations of synthetic multi-view data compared to 3D data.This paper introduces VFusion3D, a novel method for building scalable 3D generative models using pre-trained video diffusion models. The primary challenge in developing foundation 3D generative models is the limited availability of 3D data. To address this, the authors propose using a video diffusion model, EMU Video, trained with extensive text, image, and video data, as a knowledge source for 3D data. By fine-tuning this model to generate multi-view videos, they create a large-scale synthetic multi-view dataset to train VFusion3D. VFusion3D can generate high-quality 3D assets from a single image in seconds, outperforming current state-of-the-art feed-forward 3D generative models with over 90% user preference. The paper details the pipeline, training strategies, and ablation studies, demonstrating the effectiveness of VFusion3D in both single image 3D reconstruction and text-to-3D generation tasks. The results show superior performance in text alignment, image faithfulness, and visual quality, with a user study confirming the model's high quality and faithfulness. The paper also explores the scaling trends of 3D generative models and discusses the benefits and limitations of synthetic multi-view data compared to 3D data.