Understanding Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

The paper "Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models" by Fei Shen et al. addresses the challenge of generating consistent and coherent stories from multiple captions and reference clips. Current methods, which often rely on autoregressive models and caption-dependent generation, struggle to maintain contextual and temporal consistency. To tackle this, the authors propose Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach that enhances semantic and temporal consistency in story generation. In the first stage, a frame-prior transformer diffusion model is introduced to predict the semantic embeddings of frames in an unknown clip by aligning the semantic correlations between captions and frames of known clips. This model focuses on the semantic feature level, predicting frame semantic embeddings using a combination of transformer blocks and frame attention blocks. In the second stage, a frame-contextual 3D diffusion model is established to generate consistent stories by jointly infusing rich contextual conditions, including reference images, predicted frame semantic embeddings, and text embeddings of all captions, at both the image and feature levels. This model uses a multimodal interaction module and a semantic stacking module to enhance the alignment between text and image modalities. The authors conduct comprehensive experiments on two datasets, FlintstonesSV and PororoSV, and evaluate the model using objective metrics (Char-Acc, Char-F1, FID) and subjective assessments (user studies). The results demonstrate that RCDMs outperform existing methods in generating consistent and high-quality stories, with superior performance in character consistency, visual quality, and temporal coherence. Additionally, RCDMs achieve faster inference speeds compared to other state-of-the-art models, generating all story images in a single forward pass. The paper concludes by highlighting the limitations of current methods, such as their closed-set nature, and suggests future work in exploring open-set generation capabilities to allow for a broader range of characters and scenes.The paper "Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models" by Fei Shen et al. addresses the challenge of generating consistent and coherent stories from multiple captions and reference clips. Current methods, which often rely on autoregressive models and caption-dependent generation, struggle to maintain contextual and temporal consistency. To tackle this, the authors propose Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach that enhances semantic and temporal consistency in story generation. In the first stage, a frame-prior transformer diffusion model is introduced to predict the semantic embeddings of frames in an unknown clip by aligning the semantic correlations between captions and frames of known clips. This model focuses on the semantic feature level, predicting frame semantic embeddings using a combination of transformer blocks and frame attention blocks. In the second stage, a frame-contextual 3D diffusion model is established to generate consistent stories by jointly infusing rich contextual conditions, including reference images, predicted frame semantic embeddings, and text embeddings of all captions, at both the image and feature levels. This model uses a multimodal interaction module and a semantic stacking module to enhance the alignment between text and image modalities. The authors conduct comprehensive experiments on two datasets, FlintstonesSV and PororoSV, and evaluate the model using objective metrics (Char-Acc, Char-F1, FID) and subjective assessments (user studies). The results demonstrate that RCDMs outperform existing methods in generating consistent and high-quality stories, with superior performance in character consistency, visual quality, and temporal coherence. Additionally, RCDMs achieve faster inference speeds compared to other state-of-the-art models, generating all story images in a single forward pass. The paper concludes by highlighting the limitations of current methods, such as their closed-set nature, and suggests future work in exploring open-set generation capabilities to allow for a broader range of characters and scenes.

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

3 Jul 2024 | Fei Shen, Hu Ye, Sibo Liu, Jun Zhang*, Cong Wang, Xiao Han, and Wei Yang