20 Apr 2025 | Daniel Watson*, Saurabh Saxena*, Lala Li*, Andrea Tagliasacchi†, David J. Fleet†
The paper introduces 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS) that supports generation with arbitrary camera trajectories and timestamps, conditioned on one or more images. 4DiM is trained on a mixture of 3D (with camera pose), 4D (pose+time), and video (time but no pose) data, enabling it to generalize better to unseen images and camera poses compared to prior works. The model's architecture includes FiLM layers and multi-guidance to handle incomplete conditioning signals, and it uses calibrated datasets to improve metric-scale pose control. Experiments demonstrate that 4DiM outperforms prior 3D NVS models in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. The paper evaluates 4DiM on various metrics, including FID, FDD, FVD, PSNR, SSIM, LPIPS, TSED, SfM distances, and keypoint distance, and shows its effectiveness in tasks such as panorama stitching and video-to-video translation.The paper introduces 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS) that supports generation with arbitrary camera trajectories and timestamps, conditioned on one or more images. 4DiM is trained on a mixture of 3D (with camera pose), 4D (pose+time), and video (time but no pose) data, enabling it to generalize better to unseen images and camera poses compared to prior works. The model's architecture includes FiLM layers and multi-guidance to handle incomplete conditioning signals, and it uses calibrated datasets to improve metric-scale pose control. Experiments demonstrate that 4DiM outperforms prior 3D NVS models in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. The paper evaluates 4DiM on various metrics, including FID, FDD, FVD, PSNR, SSIM, LPIPS, TSED, SfM distances, and keypoint distance, and shows its effectiveness in tasks such as panorama stitching and video-to-video translation.