Understanding SEED-Story%3A Multimodal Long Story Generation with Large Language Model

SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate multimodal long stories, combining rich narrative texts and vivid images. The model can produce up to 25 multimodal sequences, even though only 10 sequences are used during training. Key contributions include: 1. **Model Architecture**: SEED-Story uses a pre-trained Vision Transformer (ViT) for visual tokenization and a diffusion model for de-tokenization to generate images. The model predicts both text and visual tokens, which are processed by an adapted visual de-tokenizer to produce consistent images. 2. **Multimodal Attention Sink Mechanism**: This mechanism enables the efficient generation of long stories by retaining the beginning of text tokens, images tokens, and the end of image tokens, ensuring coherence in the generated content. 3. **StoryStream Dataset**: A large-scale, high-resolution dataset named StoryStream is introduced for training and evaluating multimodal story generation. The dataset features rich narrative texts and engaging, high-resolution images from animated videos. 4. **Evaluation**: The effectiveness of SEED-Story is demonstrated through visual quality, style consistency, story engagement, and image-text coherence. Quantitative and qualitative evaluations show superior performance compared to existing methods. 5. **Applications**: The model has broad applications in education and entertainment, enhancing storytelling experiences by integrating text and visuals dynamically. The paper also discusses related work, method details, and limitations, emphasizing the importance of real-world data experimentation and dataset diversity to improve the model's generalization capabilities.SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate multimodal long stories, combining rich narrative texts and vivid images. The model can produce up to 25 multimodal sequences, even though only 10 sequences are used during training. Key contributions include: 1. **Model Architecture**: SEED-Story uses a pre-trained Vision Transformer (ViT) for visual tokenization and a diffusion model for de-tokenization to generate images. The model predicts both text and visual tokens, which are processed by an adapted visual de-tokenizer to produce consistent images. 2. **Multimodal Attention Sink Mechanism**: This mechanism enables the efficient generation of long stories by retaining the beginning of text tokens, images tokens, and the end of image tokens, ensuring coherence in the generated content. 3. **StoryStream Dataset**: A large-scale, high-resolution dataset named StoryStream is introduced for training and evaluating multimodal story generation. The dataset features rich narrative texts and engaging, high-resolution images from animated videos. 4. **Evaluation**: The effectiveness of SEED-Story is demonstrated through visual quality, style consistency, story engagement, and image-text coherence. Quantitative and qualitative evaluations show superior performance compared to existing methods. 5. **Applications**: The model has broad applications in education and entertainment, enhancing storytelling experiences by integrating text and visuals dynamically. The paper also discusses related work, method details, and limitations, emphasizing the importance of real-world data experimentation and dataset diversity to improve the model's generalization capabilities.

SEED-Story: Multimodal Long Story Generation with Large Language Model

2024-07-11 | Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen