SV4D is a latent video diffusion model designed for multi-frame and multi-view consistent dynamic 3D content generation. Unlike previous methods that use separate models for video generation and novel view synthesis, SV4D employs a unified diffusion model to generate novel view videos of dynamic 3D objects. Given a monocular reference video, SV4D generates novel views for each frame that are temporally consistent. It then uses these videos to optimize an implicit 4D representation (dynamic NeRF) efficiently, without the need for SDS-based optimization. SV4D is trained on a curated dynamic 3D object dataset derived from the Objaverse dataset. Experimental results on multiple datasets and user studies show that SV4D achieves state-of-the-art performance in novel-view video synthesis and 4D generation compared to prior works. SV4D's key contributions include a novel network that reasons across both frame and view axes, a mixed sampling scheme for sequential processing of long videos, and state-of-the-art results on benchmark datasets. The model uses a unified architecture combining video and multi-view diffusion models, and it generates consistent multi-view videos that are used to optimize a 4D representation. SV4D also introduces a spatio-temporal CFG scaling strategy to improve the quality of generated videos. The model is evaluated on synthetic and real-world datasets, demonstrating superior performance in terms of visual quality, video frame consistency, and multi-view consistency. User studies further confirm that SV4D produces more stable and realistic results compared to existing methods.SV4D is a latent video diffusion model designed for multi-frame and multi-view consistent dynamic 3D content generation. Unlike previous methods that use separate models for video generation and novel view synthesis, SV4D employs a unified diffusion model to generate novel view videos of dynamic 3D objects. Given a monocular reference video, SV4D generates novel views for each frame that are temporally consistent. It then uses these videos to optimize an implicit 4D representation (dynamic NeRF) efficiently, without the need for SDS-based optimization. SV4D is trained on a curated dynamic 3D object dataset derived from the Objaverse dataset. Experimental results on multiple datasets and user studies show that SV4D achieves state-of-the-art performance in novel-view video synthesis and 4D generation compared to prior works. SV4D's key contributions include a novel network that reasons across both frame and view axes, a mixed sampling scheme for sequential processing of long videos, and state-of-the-art results on benchmark datasets. The model uses a unified architecture combining video and multi-view diffusion models, and it generates consistent multi-view videos that are used to optimize a 4D representation. SV4D also introduces a spatio-temporal CFG scaling strategy to improve the quality of generated videos. The model is evaluated on synthetic and real-world datasets, demonstrating superior performance in terms of visual quality, video frame consistency, and multi-view consistency. User studies further confirm that SV4D produces more stable and realistic results compared to existing methods.