[slides] StoryDiffusion%3A Consistent Self-Attention for Long-Range Image and Video Generation

**StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation** This paper addresses the challenge of maintaining consistent content across a series of generated images, particularly those containing complex details and subjects. The authors propose a novel method called Consistent Self-Attention, which significantly enhances the consistency between generated images and can be integrated into pre-trained diffusion-based text-to-image models without additional training. To extend this method to long-range video generation, they introduce the Semantic Motion Predictor, a module trained to estimate motion conditions between two images in semantic spaces, ensuring smooth transitions and consistent subjects in generated videos. The key contributions of StoryDiffusion include: 1. **Consistent Self-Attention**: A training-free and pluggable attention module that maintains character consistency in a sequence of generated images, enhancing text controllability. 2. **Semantic Motion Predictor**: A module that predicts transitions between images in semantic spaces, generating more stable long-range video frames compared to methods relying solely on temporal modules. The authors demonstrate the effectiveness of their approach through experiments, showing that StoryDiffusion can generate subject-consistent images and videos that are more stable and controllable than existing methods. The paper also includes a user study confirming the superior performance of StoryDiffusion in both consistent image generation and transition video generation.**StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation** This paper addresses the challenge of maintaining consistent content across a series of generated images, particularly those containing complex details and subjects. The authors propose a novel method called Consistent Self-Attention, which significantly enhances the consistency between generated images and can be integrated into pre-trained diffusion-based text-to-image models without additional training. To extend this method to long-range video generation, they introduce the Semantic Motion Predictor, a module trained to estimate motion conditions between two images in semantic spaces, ensuring smooth transitions and consistent subjects in generated videos. The key contributions of StoryDiffusion include: 1. **Consistent Self-Attention**: A training-free and pluggable attention module that maintains character consistency in a sequence of generated images, enhancing text controllability. 2. **Semantic Motion Predictor**: A module that predicts transitions between images in semantic spaces, generating more stable long-range video frames compared to methods relying solely on temporal modules. The authors demonstrate the effectiveness of their approach through experiments, showing that StoryDiffusion can generate subject-consistent images and videos that are more stable and controllable than existing methods. The paper also includes a user study confirming the superior performance of StoryDiffusion in both consistent image generation and transition video generation.

STORYDIFFUSION: CONSISTENT SELF-ATTENTION FOR LONG-RANGE IMAGE AND VIDEO GENERATION

2 May 2024 | Yupeng Zhou1 Daquan Zhou2+† Ming-Ming Cheng1,3 Jiashi Feng2 Qibin Hou1,3+†