18 Jul 2024 | Junlin Han, Filippos Kokkinos, and Philip Torr
VFusion3D is a novel method for building scalable 3D generative models using pre-trained video diffusion models. The main challenge in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire, resulting in a significant disparity in scale compared to other data types. To address this, the authors propose using a video diffusion model, trained with extensive text, image, and video data, as a knowledge source for 3D data. By fine-tuning the model, they generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3 million synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance compared to current SOTA feed-forward 3D generative models, with users preferring our results over 90% of the time.
The paper introduces a pipeline for VFusion3D, which first uses a small amount of 3D data to fine-tune a video diffusion model, transforming it into a multi-view video generator. By generating a large amount of synthetic data, they train VFusion3D to generate a 3D representation and render novel views. The model is trained using a synthetic multi-view dataset generated from web-scale data and a filtering system, resulting in a dataset of 3 million multi-view videos.
The authors propose several training strategies to stabilize the training process, including a multi-stage training recipe, image-level supervision instead of pixel-level supervision, opacity loss, and camera noise injection. These strategies help improve the model's performance and robustness. The model is evaluated against several distillation-based and feed-forward 3D generative models using a user study and automated metrics. The results show that VFusion3D outperforms previous works in both generation quality and image faithfulness.
The paper also discusses the scalability of 3D generative models and the benefits of using synthetic multi-view data. The results show that 3D data is more efficient than synthetic data in teaching the model to reconstruct common objects, while large-scale synthetic multi-view data offers the ability to generalize to unusual objects and scenes. Combining both can yield the best performance.
The authors conclude that their approach leverages a video diffusion model as a multi-view data generator, facilitating the learning of scalable 3D generative models. VFusion3D, which learns from synthetic data, has shown superior performance in the generation of 3D assets. The model is highly scalable and can scale with both the number of synthetic data and 3D data, paving new paths for 3D generative models.VFusion3D is a novel method for building scalable 3D generative models using pre-trained video diffusion models. The main challenge in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire, resulting in a significant disparity in scale compared to other data types. To address this, the authors propose using a video diffusion model, trained with extensive text, image, and video data, as a knowledge source for 3D data. By fine-tuning the model, they generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3 million synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance compared to current SOTA feed-forward 3D generative models, with users preferring our results over 90% of the time.
The paper introduces a pipeline for VFusion3D, which first uses a small amount of 3D data to fine-tune a video diffusion model, transforming it into a multi-view video generator. By generating a large amount of synthetic data, they train VFusion3D to generate a 3D representation and render novel views. The model is trained using a synthetic multi-view dataset generated from web-scale data and a filtering system, resulting in a dataset of 3 million multi-view videos.
The authors propose several training strategies to stabilize the training process, including a multi-stage training recipe, image-level supervision instead of pixel-level supervision, opacity loss, and camera noise injection. These strategies help improve the model's performance and robustness. The model is evaluated against several distillation-based and feed-forward 3D generative models using a user study and automated metrics. The results show that VFusion3D outperforms previous works in both generation quality and image faithfulness.
The paper also discusses the scalability of 3D generative models and the benefits of using synthetic multi-view data. The results show that 3D data is more efficient than synthetic data in teaching the model to reconstruct common objects, while large-scale synthetic multi-view data offers the ability to generalize to unusual objects and scenes. Combining both can yield the best performance.
The authors conclude that their approach leverages a video diffusion model as a multi-view data generator, facilitating the learning of scalable 3D generative models. VFusion3D, which learns from synthetic data, has shown superior performance in the generation of 3D assets. The model is highly scalable and can scale with both the number of synthetic data and 3D data, paving new paths for 3D generative models.