MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

23 Jul 2024 | Canyu Zhao, Mingyu Liu, Wen Wang, Jianlong Yuan, Hao Chen, Bo Zhang, Chunhua Shen
MovieDreamer is a novel hierarchical framework that combines autoregressive models with diffusion-based rendering to generate long-duration videos with complex narratives and high visual fidelity. The method uses autoregressive models to ensure global narrative coherence by predicting sequences of visual tokens, which are then transformed into high-quality video frames through diffusion rendering. This approach is inspired by traditional movie production processes, where complex stories are broken down into manageable scenes. The framework also employs a multimodal script that includes detailed character information and visual style, enhancing continuity and character identity across scenes. Extensive experiments across various movie genres demonstrate that MovieDreamer achieves superior visual and narrative quality while significantly extending the duration of generated content. The method uses a diffusion autoencoder to tokenize keyframes and an autoregressive model to predict visual tokens, which are then decoded and rendered into video sequences. The framework also incorporates ID-preserving rendering to maintain character identities and supports both zero-shot and few-shot generation scenarios. The method is validated through extensive experiments, showing its effectiveness in generating visually stunning and coherent long-form videos.MovieDreamer is a novel hierarchical framework that combines autoregressive models with diffusion-based rendering to generate long-duration videos with complex narratives and high visual fidelity. The method uses autoregressive models to ensure global narrative coherence by predicting sequences of visual tokens, which are then transformed into high-quality video frames through diffusion rendering. This approach is inspired by traditional movie production processes, where complex stories are broken down into manageable scenes. The framework also employs a multimodal script that includes detailed character information and visual style, enhancing continuity and character identity across scenes. Extensive experiments across various movie genres demonstrate that MovieDreamer achieves superior visual and narrative quality while significantly extending the duration of generated content. The method uses a diffusion autoencoder to tokenize keyframes and an autoregressive model to predict visual tokens, which are then decoded and rendered into video sequences. The framework also incorporates ID-preserving rendering to maintain character identities and supports both zero-shot and few-shot generation scenarios. The method is validated through extensive experiments, showing its effectiveness in generating visually stunning and coherent long-form videos.
Reach us at info@study.space