4Diffusion is a novel multi-view video diffusion model designed to generate high-quality, spatial-temporally consistent 4D content from a monocular video. The model, named 4DM, is built upon a pre-trained 3D-aware diffusion model, ImageDream, and incorporates a learnable motion module to capture multi-view spatial-temporal correlations. This allows the model to generate consistent multi-view videos and provides guidance for 4D generation. The model is trained on a curated dataset of 926 high-quality animated 3D shapes, enabling it to learn reasonable temporal dynamics and preserve the characteristics of the original ImageDream model.
To optimize 4D representation, 4Diffusion employs a 4D-aware Score Distillation Sampling (SDS) loss, which distills prior knowledge from the 4DM model to optimize dynamic NeRF. Additionally, an anchor loss is introduced to enhance appearance details and facilitate the learning of dynamic NeRF. The combination of these losses ensures that the generated 4D content is spatial-temporally consistent and visually appealing.
Extensive qualitative and quantitative experiments demonstrate that 4Diffusion outperforms existing methods in generating 4D content from monocular videos. The model achieves superior performance in terms of spatial consistency, temporal coherence, and visual quality. The results show that 4Diffusion generates high-quality 4D content with vibrant motion coherence, making it suitable for applications such as digital humans, gaming, media, and AR/VR. The model's ability to capture multi-view spatial-temporal correlations and its efficient training process make it a promising approach for 4D generation.4Diffusion is a novel multi-view video diffusion model designed to generate high-quality, spatial-temporally consistent 4D content from a monocular video. The model, named 4DM, is built upon a pre-trained 3D-aware diffusion model, ImageDream, and incorporates a learnable motion module to capture multi-view spatial-temporal correlations. This allows the model to generate consistent multi-view videos and provides guidance for 4D generation. The model is trained on a curated dataset of 926 high-quality animated 3D shapes, enabling it to learn reasonable temporal dynamics and preserve the characteristics of the original ImageDream model.
To optimize 4D representation, 4Diffusion employs a 4D-aware Score Distillation Sampling (SDS) loss, which distills prior knowledge from the 4DM model to optimize dynamic NeRF. Additionally, an anchor loss is introduced to enhance appearance details and facilitate the learning of dynamic NeRF. The combination of these losses ensures that the generated 4D content is spatial-temporally consistent and visually appealing.
Extensive qualitative and quantitative experiments demonstrate that 4Diffusion outperforms existing methods in generating 4D content from monocular videos. The model achieves superior performance in terms of spatial consistency, temporal coherence, and visual quality. The results show that 4Diffusion generates high-quality 4D content with vibrant motion coherence, making it suitable for applications such as digital humans, gaming, media, and AR/VR. The model's ability to capture multi-view spatial-temporal correlations and its efficient training process make it a promising approach for 4D generation.