StoryDiffusion is a novel method for generating consistent images and videos for storytelling. The method introduces Consistent Self-Attention, a training-free and pluggable attention mechanism that enhances the consistency between generated images. This mechanism is integrated into the diffusion backbone to replace the original self-attention in a zero-shot manner. Additionally, a new semantic space temporal motion prediction module, named Semantic Motion Predictor, is introduced to generate smooth and stable video transitions. The Semantic Motion Predictor predicts transitions between two images in the semantic space, which results in more stable video frames compared to methods based on latent spaces. By combining these two components, StoryDiffusion can generate consistent images and videos that effectively narrate a story. The method is evaluated on both image and video generation tasks, demonstrating superior performance in maintaining consistency and generating smooth transitions. The results show that StoryDiffusion outperforms existing methods in generating subject-consistent images and videos, with high text controllability and smooth transitions. The method is also tested with user studies, confirming its effectiveness in generating consistent and visually appealing images and videos.StoryDiffusion is a novel method for generating consistent images and videos for storytelling. The method introduces Consistent Self-Attention, a training-free and pluggable attention mechanism that enhances the consistency between generated images. This mechanism is integrated into the diffusion backbone to replace the original self-attention in a zero-shot manner. Additionally, a new semantic space temporal motion prediction module, named Semantic Motion Predictor, is introduced to generate smooth and stable video transitions. The Semantic Motion Predictor predicts transitions between two images in the semantic space, which results in more stable video frames compared to methods based on latent spaces. By combining these two components, StoryDiffusion can generate consistent images and videos that effectively narrate a story. The method is evaluated on both image and video generation tasks, demonstrating superior performance in maintaining consistency and generating smooth transitions. The results show that StoryDiffusion outperforms existing methods in generating subject-consistent images and videos, with high text controllability and smooth transitions. The method is also tested with user studies, confirming its effectiveness in generating consistent and visually appealing images and videos.