**MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences**
Recent advancements in video generation have primarily focused on short-duration content using diffusion models, but these approaches struggle with complex narratives and character consistency over extended periods, which is crucial for long-form video production like movies. To address this, the authors propose MovieDreamer, a novel hierarchical framework that integrates autoregressive models with diffusion-based rendering to generate long-duration videos with intricate plot progressions and high visual fidelity.
**Key Contributions:**
1. **Hierarchical Framework:** MovieDreamer combines autoregressive models for global narrative coherence and diffusion rendering for high-quality visual fidelity.
2. **Multimodal Script:** A structured multimodal script enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes.
3. **Identity-Preserving Rendering:** An identity-preserving diffusion decoder mitigates errors in vision token prediction and improves identity preservation in the generated videos.
**Methodology:**
- **Diffusion Autoencoder:** Tokenizes keyframes into compact visual tokens using a diffusion autoencoder.
- **Autoregressive Keyframe Token Generation:** Uses a multimodal autoregressive model to predict visual tokens based on the multimodal script and historical data.
- **Anti-Overfitting Strategies:** Implement techniques like data augmentation, face embedding randomization, aggressive dropout, and token masking to combat overfitting.
- **Multi-Modal Scripts:** Structures the script to include rich descriptions of scenes and character identities, enhancing narrative continuity and character control.
- **ID-Preserving Rendering:** Enhances the decoder to better preserve character identities, particularly in facial features.
- **Keyframe-Based Video Generation:** Utilizes the last frame of the previously generated video as an anchor to enhance the model's awareness of the original image distribution.
**Experiments:**
- **Dataset and Evaluation Metrics:** Uses a test dataset of 10 long movies and evaluates using CLIP score, Aesthetic Score (AS), Frechet Image Distance (FID), and Inception Score (IS).
- **Comparison with State-of-the-Art:** Demonstrates superior performance in generating long videos with high visual and narrative quality, maintaining both short-term and long-term consistency.
**Conclusion:**
MovieDreamer addresses the challenge of generating long-duration visual content with complex narratives by combining the strengths of autoregression and diffusion. It introduces a multi-modal script to preserve character consistency and an ID-preserving rendering technique to enhance character identity preservation. This work opens up exciting possibilities for automated long-duration video production.**MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences**
Recent advancements in video generation have primarily focused on short-duration content using diffusion models, but these approaches struggle with complex narratives and character consistency over extended periods, which is crucial for long-form video production like movies. To address this, the authors propose MovieDreamer, a novel hierarchical framework that integrates autoregressive models with diffusion-based rendering to generate long-duration videos with intricate plot progressions and high visual fidelity.
**Key Contributions:**
1. **Hierarchical Framework:** MovieDreamer combines autoregressive models for global narrative coherence and diffusion rendering for high-quality visual fidelity.
2. **Multimodal Script:** A structured multimodal script enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes.
3. **Identity-Preserving Rendering:** An identity-preserving diffusion decoder mitigates errors in vision token prediction and improves identity preservation in the generated videos.
**Methodology:**
- **Diffusion Autoencoder:** Tokenizes keyframes into compact visual tokens using a diffusion autoencoder.
- **Autoregressive Keyframe Token Generation:** Uses a multimodal autoregressive model to predict visual tokens based on the multimodal script and historical data.
- **Anti-Overfitting Strategies:** Implement techniques like data augmentation, face embedding randomization, aggressive dropout, and token masking to combat overfitting.
- **Multi-Modal Scripts:** Structures the script to include rich descriptions of scenes and character identities, enhancing narrative continuity and character control.
- **ID-Preserving Rendering:** Enhances the decoder to better preserve character identities, particularly in facial features.
- **Keyframe-Based Video Generation:** Utilizes the last frame of the previously generated video as an anchor to enhance the model's awareness of the original image distribution.
**Experiments:**
- **Dataset and Evaluation Metrics:** Uses a test dataset of 10 long movies and evaluates using CLIP score, Aesthetic Score (AS), Frechet Image Distance (FID), and Inception Score (IS).
- **Comparison with State-of-the-Art:** Demonstrates superior performance in generating long videos with high visual and narrative quality, maintaining both short-term and long-term consistency.
**Conclusion:**
MovieDreamer addresses the challenge of generating long-duration visual content with complex narratives by combining the strengths of autoregression and diffusion. It introduces a multi-modal script to preserve character consistency and an ID-preserving rendering technique to enhance character identity preservation. This work opens up exciting possibilities for automated long-duration video production.