**StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation**
This paper addresses the challenge of maintaining consistent content across a series of generated images, particularly those containing complex details and subjects. The authors propose a novel method called Consistent Self-Attention, which significantly enhances the consistency between generated images and can be integrated into pre-trained diffusion-based text-to-image models without additional training. To extend this method to long-range video generation, they introduce the Semantic Motion Predictor, a module trained to estimate motion conditions between two images in semantic spaces, ensuring smooth transitions and consistent subjects in generated videos.
The key contributions of StoryDiffusion include:
1. **Consistent Self-Attention**: A training-free and pluggable attention module that maintains character consistency in a sequence of generated images, enhancing text controllability.
2. **Semantic Motion Predictor**: A module that predicts transitions between images in semantic spaces, generating more stable long-range video frames compared to methods relying solely on temporal modules.
The authors demonstrate the effectiveness of their approach through experiments, showing that StoryDiffusion can generate subject-consistent images and videos that are more stable and controllable than existing methods. The paper also includes a user study confirming the superior performance of StoryDiffusion in both consistent image generation and transition video generation.**StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation**
This paper addresses the challenge of maintaining consistent content across a series of generated images, particularly those containing complex details and subjects. The authors propose a novel method called Consistent Self-Attention, which significantly enhances the consistency between generated images and can be integrated into pre-trained diffusion-based text-to-image models without additional training. To extend this method to long-range video generation, they introduce the Semantic Motion Predictor, a module trained to estimate motion conditions between two images in semantic spaces, ensuring smooth transitions and consistent subjects in generated videos.
The key contributions of StoryDiffusion include:
1. **Consistent Self-Attention**: A training-free and pluggable attention module that maintains character consistency in a sequence of generated images, enhancing text controllability.
2. **Semantic Motion Predictor**: A module that predicts transitions between images in semantic spaces, generating more stable long-range video frames compared to methods relying solely on temporal modules.
The authors demonstrate the effectiveness of their approach through experiments, showing that StoryDiffusion can generate subject-consistent images and videos that are more stable and controllable than existing methods. The paper also includes a user study confirming the superior performance of StoryDiffusion in both consistent image generation and transition video generation.