Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models

Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models

26 May 2024 | Hanwen Liang¹, Yuyang Yin², Dejia Xu³, Hanxue Liang⁴, Zhangyang Wang³, Konstantinos N. Plataniotis¹, Yao Zhao², Yunchao Wei²†
Diffusion4D is a novel framework for efficient and spatial-temporal consistent 4D generation using video diffusion models. The method addresses the challenges of generating high-quality 4D content by integrating temporal consistency from video diffusion models into spatial-temporal consistency required for 4D generation. A large-scale, high-quality 4D dataset is curated from existing 3D datasets to train a 4D-aware video diffusion model capable of generating orbital views of dynamic 3D assets. A 3D-to-4D motion magnitude metric is introduced to control the dynamic strength of the assets, and a motion magnitude reconstruction loss is proposed to refine the learning of 3D-to-4D dynamics. A 3D-aware classifier-free guidance is also introduced to enhance the dynamics of 3D assets. The framework performs explicit 4D construction using Gaussian splatting in a coarse-to-fine manner, enabling the synthesis of multi-view consistent 4D images. The method outperforms prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency across various prompt modalities. The framework is capable of generating high-fidelity and diverse 4D assets within several minutes. The method is evaluated on multiple tasks, including text-conditioned and image-conditioned 4D generation, and demonstrates superior performance in terms of 3D geometry consistency, appearance quality, motion fidelity, and text alignment. The framework also includes an ablation study showing the effectiveness of various components in achieving spatial-temporal consistency in 4D generation. The results indicate that the full model delivers the best performance both quantitatively and qualitatively.Diffusion4D is a novel framework for efficient and spatial-temporal consistent 4D generation using video diffusion models. The method addresses the challenges of generating high-quality 4D content by integrating temporal consistency from video diffusion models into spatial-temporal consistency required for 4D generation. A large-scale, high-quality 4D dataset is curated from existing 3D datasets to train a 4D-aware video diffusion model capable of generating orbital views of dynamic 3D assets. A 3D-to-4D motion magnitude metric is introduced to control the dynamic strength of the assets, and a motion magnitude reconstruction loss is proposed to refine the learning of 3D-to-4D dynamics. A 3D-aware classifier-free guidance is also introduced to enhance the dynamics of 3D assets. The framework performs explicit 4D construction using Gaussian splatting in a coarse-to-fine manner, enabling the synthesis of multi-view consistent 4D images. The method outperforms prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency across various prompt modalities. The framework is capable of generating high-fidelity and diverse 4D assets within several minutes. The method is evaluated on multiple tasks, including text-conditioned and image-conditioned 4D generation, and demonstrates superior performance in terms of 3D geometry consistency, appearance quality, motion fidelity, and text alignment. The framework also includes an ablation study showing the effectiveness of various components in achieving spatial-temporal consistency in 4D generation. The results indicate that the full model delivers the best performance both quantitatively and qualitatively.
Reach us at info@study.space