4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

20 Nov 2024 | Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, László A Jeni, Sergey Tulyakov, Hsin-Ying Lee
4Real is a novel framework for generating near-photorealistic 4D scenes from text prompts. The method uses deformable 3D Gaussian Splats (D-3DGS) to model dynamic scenes. It begins by generating a reference video using a video diffusion model. A freeze-time video is then created from the reference video to learn the canonical 3D representation. Per-frame deformations are learned to address inconsistencies in the freeze-time video. Temporal deformations are then learned based on the canonical representation to capture dynamic interactions. The pipeline enables the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives. The method outperforms existing methods in terms of photorealism, diversity, and computational efficiency. It achieves this by leveraging video diffusion models trained on large-scale real-world data, avoiding the reliance on multi-view generative models. The method is evaluated on a set of text prompts and compared to state-of-the-art object-centric 4D generation methods, demonstrating superior performance in terms of motion realism, foreground/background realism, 3D shape realism, and video-text alignment. The method is also compared to a baseline that combines 3D scene generation with 4D object-centric generation, showing that 4Real produces more natural results in terms of object placement, motion, and lighting. The method is computationally efficient, taking only 1.5 hours on an A100 GPU, compared to over 10 hours for competing methods. The method is also evaluated using quantitative metrics such as X-CLIP and VideoScore, showing significant improvements over other methods. The method is limited by the underlying video generation model, including issues with resolution, blurriness, and artifacts during fast motion, as well as challenges in reconstructing dynamic content and generating high-quality geometry such as meshes. Future work aims to improve the method by incorporating stronger video generation models with more accurate camera pose and object motion control, as well as cross-frame attention and feedforward 3D reconstruction.4Real is a novel framework for generating near-photorealistic 4D scenes from text prompts. The method uses deformable 3D Gaussian Splats (D-3DGS) to model dynamic scenes. It begins by generating a reference video using a video diffusion model. A freeze-time video is then created from the reference video to learn the canonical 3D representation. Per-frame deformations are learned to address inconsistencies in the freeze-time video. Temporal deformations are then learned based on the canonical representation to capture dynamic interactions. The pipeline enables the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives. The method outperforms existing methods in terms of photorealism, diversity, and computational efficiency. It achieves this by leveraging video diffusion models trained on large-scale real-world data, avoiding the reliance on multi-view generative models. The method is evaluated on a set of text prompts and compared to state-of-the-art object-centric 4D generation methods, demonstrating superior performance in terms of motion realism, foreground/background realism, 3D shape realism, and video-text alignment. The method is also compared to a baseline that combines 3D scene generation with 4D object-centric generation, showing that 4Real produces more natural results in terms of object placement, motion, and lighting. The method is computationally efficient, taking only 1.5 hours on an A100 GPU, compared to over 10 hours for competing methods. The method is also evaluated using quantitative metrics such as X-CLIP and VideoScore, showing significant improvements over other methods. The method is limited by the underlying video generation model, including issues with resolution, blurriness, and artifacts during fast motion, as well as challenges in reconstructing dynamic content and generating high-quality geometry such as meshes. Future work aims to improve the method by incorporating stronger video generation models with more accurate camera pose and object motion control, as well as cross-frame attention and feedforward 3D reconstruction.
Reach us at info@futurestudyspace.com
[slides and audio] 4Real%3A Towards Photorealistic 4D Scene Generation via Video Diffusion Models