CONTROLLING SPACE AND TIME WITH DIFFUSION MODELS

CONTROLLING SPACE AND TIME WITH DIFFUSION MODELS

20 Apr 2025 | Daniel Watson*, Saurabh Saxena*, Lala Li*, Andrea Tagliasacchi†, David J. Fleet†
4DiM is a cascaded diffusion model for 4D novel view synthesis (NVS), enabling generation with arbitrary camera trajectories and timestamps in natural scenes, conditioned on one or more images. It supports training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, improving generalization to unseen images and camera pose trajectories. 4DiM is the first NVS method with intuitive metric-scale camera pose control, enabled by a novel calibration pipeline for structure-from-motion-posed data. Experiments show that 4DiM outperforms prior 3D NVS models in image fidelity and pose alignment, while enabling scene dynamics generation. It provides a general framework for tasks like single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation. 4DiM is trained on a mixture of data sources, including posed 3D images/video and unposed video of both indoor and outdoor scenes. It uses Masked FiLM layers and multi-guidance to handle incomplete conditioning signals and improve sample quality. A scale-calibrated version of RealEstate10K is created to improve model fidelity and enable metric pose control. 4DiM achieves superior results in image quality (FID, FDD, FVD) and reconstruction-based metrics (PSNR, SSIM, LPIPS) compared to baselines. It also demonstrates better dynamics and pose alignment, especially in zero-shot settings. 4DiM is trained on a mixture of data sources, including posed 3D images/video and unposed video, and shows significant improvements in fidelity and generalization when trained on calibrated data. It also benefits from co-training with video data, which improves fidelity and generalization. 4DiM is capable of generating novel views from single images and demonstrates superior performance in tasks like panorama stitching and video-to-video translation. It is also effective in generating dynamic content and aligning with target camera poses. 4DiM is the first model capable of generating multiple, approximately consistent views over simultaneous camera and time control from as few as a single input image. It achieves state-of-the-art pose alignment and much better generalization compared to prior work on 3D NVS. 4DiM is trained on a mixture of data sources, including posed 3D images/video and unposed video, and shows significant improvements in fidelity and generalization when trained on calibrated data. It also benefits from co-training with video data, which improves fidelity and generalization. 4DiM is capable of generating novel views from single images and demonstrates superior performance in tasks like panorama stitching and video-to-video translation. It is also effective in generating dynamic content and aligning with target camera poses.4DiM is a cascaded diffusion model for 4D novel view synthesis (NVS), enabling generation with arbitrary camera trajectories and timestamps in natural scenes, conditioned on one or more images. It supports training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, improving generalization to unseen images and camera pose trajectories. 4DiM is the first NVS method with intuitive metric-scale camera pose control, enabled by a novel calibration pipeline for structure-from-motion-posed data. Experiments show that 4DiM outperforms prior 3D NVS models in image fidelity and pose alignment, while enabling scene dynamics generation. It provides a general framework for tasks like single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation. 4DiM is trained on a mixture of data sources, including posed 3D images/video and unposed video of both indoor and outdoor scenes. It uses Masked FiLM layers and multi-guidance to handle incomplete conditioning signals and improve sample quality. A scale-calibrated version of RealEstate10K is created to improve model fidelity and enable metric pose control. 4DiM achieves superior results in image quality (FID, FDD, FVD) and reconstruction-based metrics (PSNR, SSIM, LPIPS) compared to baselines. It also demonstrates better dynamics and pose alignment, especially in zero-shot settings. 4DiM is trained on a mixture of data sources, including posed 3D images/video and unposed video, and shows significant improvements in fidelity and generalization when trained on calibrated data. It also benefits from co-training with video data, which improves fidelity and generalization. 4DiM is capable of generating novel views from single images and demonstrates superior performance in tasks like panorama stitching and video-to-video translation. It is also effective in generating dynamic content and aligning with target camera poses. 4DiM is the first model capable of generating multiple, approximately consistent views over simultaneous camera and time control from as few as a single input image. It achieves state-of-the-art pose alignment and much better generalization compared to prior work on 3D NVS. 4DiM is trained on a mixture of data sources, including posed 3D images/video and unposed video, and shows significant improvements in fidelity and generalization when trained on calibrated data. It also benefits from co-training with video data, which improves fidelity and generalization. 4DiM is capable of generating novel views from single images and demonstrates superior performance in tasks like panorama stitching and video-to-video translation. It is also effective in generating dynamic content and aligning with target camera poses.
Reach us at info@study.space