MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

23 Jul 2024 | Canyu Zhao, Mingyu Liu, Wen Wang, Jianlong Yuan, Hao Chen, Bo Zhang, Chunhua Shen
**MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences** Recent advancements in video generation have primarily focused on short-duration content using diffusion models, but these approaches struggle with complex narratives and character consistency over extended periods, which is crucial for long-form video production like movies. To address this, the authors propose MovieDreamer, a novel hierarchical framework that integrates autoregressive models with diffusion-based rendering to generate long-duration videos with intricate plot progressions and high visual fidelity. **Key Contributions:** 1. **Hierarchical Framework:** MovieDreamer combines autoregressive models for global narrative coherence and diffusion rendering for high-quality visual fidelity. 2. **Multimodal Script:** A structured multimodal script enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes. 3. **Identity-Preserving Rendering:** An identity-preserving diffusion decoder mitigates errors in vision token prediction and improves identity preservation in the generated videos. **Methodology:** - **Diffusion Autoencoder:** Tokenizes keyframes into compact visual tokens using a diffusion autoencoder. - **Autoregressive Keyframe Token Generation:** Uses a multimodal autoregressive model to predict visual tokens based on the multimodal script and historical data. - **Anti-Overfitting Strategies:** Implement techniques like data augmentation, face embedding randomization, aggressive dropout, and token masking to combat overfitting. - **Multi-Modal Scripts:** Structures the script to include rich descriptions of scenes and character identities, enhancing narrative continuity and character control. - **ID-Preserving Rendering:** Enhances the decoder to better preserve character identities, particularly in facial features. - **Keyframe-Based Video Generation:** Utilizes the last frame of the previously generated video as an anchor to enhance the model's awareness of the original image distribution. **Experiments:** - **Dataset and Evaluation Metrics:** Uses a test dataset of 10 long movies and evaluates using CLIP score, Aesthetic Score (AS), Frechet Image Distance (FID), and Inception Score (IS). - **Comparison with State-of-the-Art:** Demonstrates superior performance in generating long videos with high visual and narrative quality, maintaining both short-term and long-term consistency. **Conclusion:** MovieDreamer addresses the challenge of generating long-duration visual content with complex narratives by combining the strengths of autoregression and diffusion. It introduces a multi-modal script to preserve character consistency and an ID-preserving rendering technique to enhance character identity preservation. This work opens up exciting possibilities for automated long-duration video production.**MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences** Recent advancements in video generation have primarily focused on short-duration content using diffusion models, but these approaches struggle with complex narratives and character consistency over extended periods, which is crucial for long-form video production like movies. To address this, the authors propose MovieDreamer, a novel hierarchical framework that integrates autoregressive models with diffusion-based rendering to generate long-duration videos with intricate plot progressions and high visual fidelity. **Key Contributions:** 1. **Hierarchical Framework:** MovieDreamer combines autoregressive models for global narrative coherence and diffusion rendering for high-quality visual fidelity. 2. **Multimodal Script:** A structured multimodal script enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes. 3. **Identity-Preserving Rendering:** An identity-preserving diffusion decoder mitigates errors in vision token prediction and improves identity preservation in the generated videos. **Methodology:** - **Diffusion Autoencoder:** Tokenizes keyframes into compact visual tokens using a diffusion autoencoder. - **Autoregressive Keyframe Token Generation:** Uses a multimodal autoregressive model to predict visual tokens based on the multimodal script and historical data. - **Anti-Overfitting Strategies:** Implement techniques like data augmentation, face embedding randomization, aggressive dropout, and token masking to combat overfitting. - **Multi-Modal Scripts:** Structures the script to include rich descriptions of scenes and character identities, enhancing narrative continuity and character control. - **ID-Preserving Rendering:** Enhances the decoder to better preserve character identities, particularly in facial features. - **Keyframe-Based Video Generation:** Utilizes the last frame of the previously generated video as an anchor to enhance the model's awareness of the original image distribution. **Experiments:** - **Dataset and Evaluation Metrics:** Uses a test dataset of 10 long movies and evaluates using CLIP score, Aesthetic Score (AS), Frechet Image Distance (FID), and Inception Score (IS). - **Comparison with State-of-the-Art:** Demonstrates superior performance in generating long videos with high visual and narrative quality, maintaining both short-term and long-term consistency. **Conclusion:** MovieDreamer addresses the challenge of generating long-duration visual content with complex narratives by combining the strengths of autoregression and diffusion. It introduces a multi-modal script to preserve character consistency and an ID-preserving rendering technique to enhance character identity preservation. This work opens up exciting possibilities for automated long-duration video production.
Reach us at info@study.space
[slides and audio] MovieDreamer%3A Hierarchical Generation for Coherent Long Visual Sequence