SEED-Story: Multimodal Long Story Generation with Large Language Model

SEED-Story: Multimodal Long Story Generation with Large Language Model

11 Jul 2024 | Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. The model is designed to produce rich narrative texts and images that are consistent in characters and style. It can generate stories with up to 25 multimodal sequences, even though only 10 sequences are used during training. The model uses a multimodal attention sink mechanism to enable efficient generation of long stories. Additionally, a large-scale and high-resolution dataset named StoryStream is introduced for training and evaluating multimodal story generation. The dataset is four times larger than existing story datasets and features higher image resolution, longer sequence lengths, and more detailed story narratives. The model, codes, and datasets are available at https://github.com/TencentARC/SEED-Story. The model is evaluated on various aspects of multimodal story generation, including image style consistency, story engagement, and image-text coherence. The results show that SEED-Story achieves superior performance in these aspects. The model is also compared with existing methods such as MM-interleaved, and it demonstrates better performance in terms of image quality, style consistency, and story engagement. The model is capable of generating long stories with engaging plots and vivid images. The model is trained on a dataset constructed from cartoon series, which inherently contain rich plots and consistent character portrayals. The dataset is constructed by extracting keyframes and their associated subtitles, and then generating detailed image descriptions. The dataset is then used to train the model, which is fine-tuned using LoRA. The model is also evaluated on various aspects of multimodal story generation, including image style consistency, story engagement, and image-text coherence. The results show that SEED-Story achieves superior performance in these aspects. The model is also compared with existing methods such as MM-interleaved, and it demonstrates better performance in terms of image quality, style consistency, and story engagement. The model is capable of generating long stories with engaging plots and vivid images.SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. The model is designed to produce rich narrative texts and images that are consistent in characters and style. It can generate stories with up to 25 multimodal sequences, even though only 10 sequences are used during training. The model uses a multimodal attention sink mechanism to enable efficient generation of long stories. Additionally, a large-scale and high-resolution dataset named StoryStream is introduced for training and evaluating multimodal story generation. The dataset is four times larger than existing story datasets and features higher image resolution, longer sequence lengths, and more detailed story narratives. The model, codes, and datasets are available at https://github.com/TencentARC/SEED-Story. The model is evaluated on various aspects of multimodal story generation, including image style consistency, story engagement, and image-text coherence. The results show that SEED-Story achieves superior performance in these aspects. The model is also compared with existing methods such as MM-interleaved, and it demonstrates better performance in terms of image quality, style consistency, and story engagement. The model is capable of generating long stories with engaging plots and vivid images. The model is trained on a dataset constructed from cartoon series, which inherently contain rich plots and consistent character portrayals. The dataset is constructed by extracting keyframes and their associated subtitles, and then generating detailed image descriptions. The dataset is then used to train the model, which is fine-tuned using LoRA. The model is also evaluated on various aspects of multimodal story generation, including image style consistency, story engagement, and image-text coherence. The results show that SEED-Story achieves superior performance in these aspects. The model is also compared with existing methods such as MM-interleaved, and it demonstrates better performance in terms of image quality, style consistency, and story engagement. The model is capable of generating long stories with engaging plots and vivid images.
Reach us at info@study.space