3 Jul 2024 | Fei Shen, Hu Ye, Sibo Liu, Jun Zhang*, Cong Wang, Xiao Han, and Wei Yang
This paper introduces Rich-Contextual Conditional Diffusion Models (RCDMs) to enhance the consistency of story visualization by incorporating rich contextual conditions at both the image and feature levels. The proposed method consists of two stages: a frame-prior transformer diffusion model and a frame-contextual 3D diffusion model. In the first stage, the frame-prior transformer diffusion model predicts the semantic embeddings of unknown clips by aligning the semantic correlations between known clips and their captions. In the second stage, the model integrates rich contextual conditions, including reference images, predicted frame embeddings, and text embeddings, to generate consistent stories. RCDMs can generate consistent stories with a single forward inference, unlike autoregressive models. Qualitative and quantitative results show that RCDMs outperform existing methods in challenging scenarios. The method is evaluated on two datasets, and user studies confirm its effectiveness in generating visually consistent and temporally coherent stories. The code and model are available at https://github.com/muzishen/RCDMs.This paper introduces Rich-Contextual Conditional Diffusion Models (RCDMs) to enhance the consistency of story visualization by incorporating rich contextual conditions at both the image and feature levels. The proposed method consists of two stages: a frame-prior transformer diffusion model and a frame-contextual 3D diffusion model. In the first stage, the frame-prior transformer diffusion model predicts the semantic embeddings of unknown clips by aligning the semantic correlations between known clips and their captions. In the second stage, the model integrates rich contextual conditions, including reference images, predicted frame embeddings, and text embeddings, to generate consistent stories. RCDMs can generate consistent stories with a single forward inference, unlike autoregressive models. Qualitative and quantitative results show that RCDMs outperform existing methods in challenging scenarios. The method is evaluated on two datasets, and user studies confirm its effectiveness in generating visually consistent and temporally coherent stories. The code and model are available at https://github.com/muzishen/RCDMs.