Understanding JeDi%3A Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation **Abstract:** JeDi is a novel finetuning-free personalization model for text-to-image generation that can operate on any number of reference images. It preserves the appearance of custom subjects while generating novel variations. Unlike prior models, JeDi does not suffer from overfitting or lack of diversity. The key idea is to learn the joint distribution of multiple related text-image pairs sharing a common subject. JeDi uses a scalable synthetic dataset generation technique and a modified diffusion model architecture to encode relationships between multiple images. At test time, JeDi can generate personalized images using reference images as input during the sampling process. Experimental results show that JeDi achieves state-of-the-art performance in personalized text-to-image generation, even with a single reference image. **Introduction:** Personalized text-to-image generation models enable users to create images depicting their individual possessions in diverse scenes. Existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which is resource-intensive and time-consuming. JeDi addresses this by learning a joint distribution of multiple related text-image pairs, avoiding the need for finetuning. The model is trained using a synthetic dataset of related images and modified to encode relationships between images. At inference, JeDi can generate personalized images based on multiple text prompts and reference images, achieving high-fidelity results even for challenging subjects. **Method:** JeDi constructs a Synthetic Same-Subject (S³) dataset using large language models and single-image diffusion models. The dataset contains sets of images sharing a common subject. The model is trained to denoise multiple same-subject images using coupled self-attention layers, which fuse self-attention features across images. During inference, personalized generation is performed as an inpainting task, where missing images are generated based on reference images. Image guidance techniques further improve the alignment of generated images with the input reference images. **Experiments:** JeDi is evaluated on benchmark datasets and compared with state-of-the-art methods. Results show that JeDi outperforms finetuning-free and finetuning-based approaches in terms of image alignment and faithfulness to input reference images. JeDi can generate high-quality images with diverse content while preserving the key visual features of the subjects in input images. **Conclusion:** JeDi is a novel finetuning-free personalized text-to-image generation model that excels at preserving input reference content. It uses a joint-image diffusion model and a scalable data synthesis pipeline to achieve high-fidelity personalization results.JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation **Abstract:** JeDi is a novel finetuning-free personalization model for text-to-image generation that can operate on any number of reference images. It preserves the appearance of custom subjects while generating novel variations. Unlike prior models, JeDi does not suffer from overfitting or lack of diversity. The key idea is to learn the joint distribution of multiple related text-image pairs sharing a common subject. JeDi uses a scalable synthetic dataset generation technique and a modified diffusion model architecture to encode relationships between multiple images. At test time, JeDi can generate personalized images using reference images as input during the sampling process. Experimental results show that JeDi achieves state-of-the-art performance in personalized text-to-image generation, even with a single reference image. **Introduction:** Personalized text-to-image generation models enable users to create images depicting their individual possessions in diverse scenes. Existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which is resource-intensive and time-consuming. JeDi addresses this by learning a joint distribution of multiple related text-image pairs, avoiding the need for finetuning. The model is trained using a synthetic dataset of related images and modified to encode relationships between images. At inference, JeDi can generate personalized images based on multiple text prompts and reference images, achieving high-fidelity results even for challenging subjects. **Method:** JeDi constructs a Synthetic Same-Subject (S³) dataset using large language models and single-image diffusion models. The dataset contains sets of images sharing a common subject. The model is trained to denoise multiple same-subject images using coupled self-attention layers, which fuse self-attention features across images. During inference, personalized generation is performed as an inpainting task, where missing images are generated based on reference images. Image guidance techniques further improve the alignment of generated images with the input reference images. **Experiments:** JeDi is evaluated on benchmark datasets and compared with state-of-the-art methods. Results show that JeDi outperforms finetuning-free and finetuning-based approaches in terms of image alignment and faithfulness to input reference images. JeDi can generate high-quality images with diverse content while preserving the key visual features of the subjects in input images. **Conclusion:** JeDi is a novel finetuning-free personalized text-to-image generation model that excels at preserving input reference content. It uses a joint-image diffusion model and a scalable data synthesis pipeline to achieve high-fidelity personalization results.

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

8 Jul 2024 | Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, Yogesh Balaji