The paper introduces 4Diffusion, a novel 4D generation pipeline that generates spatial-temporally consistent 4D content from monocular videos. The method addresses the challenges of multi-view spatial-temporal modeling and integrating diverse prior knowledge from multiple diffusion models, which are common issues in current 4D generation methods. 4Diffusion proposes a unified diffusion model, 4DM, which incorporates a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations. This model is trained on a curated dataset of high-quality multi-view videos, achieving reasonable temporal consistency and preserving the generalizability and spatial consistency of the 3D-aware diffusion model. The paper also introduces a 4D-aware Score Distillation Sampling loss to optimize the 4D representation parameterized by dynamic NeRF, eliminating discrepancies arising from multiple diffusion models. Additionally, an anchor loss is devised to enhance appearance details and facilitate the learning of dynamic NeRF. Extensive qualitative and quantitative experiments demonstrate that 4Diffusion outperforms previous methods in generating high-quality 4D content with consistent spatial-temporal appearance and vibrant motion coherence.The paper introduces 4Diffusion, a novel 4D generation pipeline that generates spatial-temporally consistent 4D content from monocular videos. The method addresses the challenges of multi-view spatial-temporal modeling and integrating diverse prior knowledge from multiple diffusion models, which are common issues in current 4D generation methods. 4Diffusion proposes a unified diffusion model, 4DM, which incorporates a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations. This model is trained on a curated dataset of high-quality multi-view videos, achieving reasonable temporal consistency and preserving the generalizability and spatial consistency of the 3D-aware diffusion model. The paper also introduces a 4D-aware Score Distillation Sampling loss to optimize the 4D representation parameterized by dynamic NeRF, eliminating discrepancies arising from multiple diffusion models. Additionally, an anchor loss is devised to enhance appearance details and facilitate the learning of dynamic NeRF. Extensive qualitative and quantitative experiments demonstrate that 4Diffusion outperforms previous methods in generating high-quality 4D content with consistent spatial-temporal appearance and vibrant motion coherence.