[slides and audio] Training-Free Consistent Text-to-Image Generation

Training-Free Consistent Text-to-Image Generation is a method that enables the generation of consistent images across various prompts without requiring training or optimization. The approach, called ConsiStory, leverages the internal activations of a pre-trained text-to-image diffusion model to maintain subject consistency. It introduces a subject-driven shared attention block and correspondence-based feature injection to promote consistency between images. Additionally, strategies are developed to encourage layout diversity while maintaining subject consistency. ConsiStory is compared to various baselines and demonstrates state-of-the-art performance on subject consistency and text alignment without requiring a single optimization step. It can naturally extend to multi-subject scenarios and even enable training-free personalization for common objects. The method is compatible with existing editing tools like ControlNet and can be used for training-free personalization of common object classes. The approach is evaluated through qualitative and quantitative experiments, showing that it outperforms baselines in terms of subject consistency and text alignment. The method is also efficient, achieving faster generation times than existing approaches. Limitations include reliance on cross-attention maps for object localization and potential biases in the underlying SDXL model. The method is able to generate consistent subjects across different styles and can handle occluded objects. The approach is suitable for a wide range of applications, including multi-subject scenarios and training-free personalization.Training-Free Consistent Text-to-Image Generation is a method that enables the generation of consistent images across various prompts without requiring training or optimization. The approach, called ConsiStory, leverages the internal activations of a pre-trained text-to-image diffusion model to maintain subject consistency. It introduces a subject-driven shared attention block and correspondence-based feature injection to promote consistency between images. Additionally, strategies are developed to encourage layout diversity while maintaining subject consistency. ConsiStory is compared to various baselines and demonstrates state-of-the-art performance on subject consistency and text alignment without requiring a single optimization step. It can naturally extend to multi-subject scenarios and even enable training-free personalization for common objects. The method is compatible with existing editing tools like ControlNet and can be used for training-free personalization of common object classes. The approach is evaluated through qualitative and quantitative experiments, showing that it outperforms baselines in terms of subject consistency and text alignment. The method is also efficient, achieving faster generation times than existing approaches. Limitations include reliance on cross-attention maps for object localization and potential biases in the underlying SDXL model. The method is able to generate consistent subjects across different styles and can handle occluded objects. The approach is suitable for a wide range of applications, including multi-subject scenarios and training-free personalization.

Training-Free Consistent Text-to-Image Generation

July 2024 | YOAD TEWEL, OMRI KADURI, RINON GAL, YONI KASTEN, LIOR WOLF, GAL CHECHIK, YUVAL ATZMON