JeDi is a finetuning-free personalized text-to-image generation model that can generate images based on multiple reference images without requiring any additional modules or optimization. The model learns the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, a scalable synthetic dataset generation technique is proposed. Once trained, the model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. The model does not require any expensive optimization process and can faithfully preserve the identity represented by any number of reference images. Experimental results show that the model achieves state-of-the-art performance in both quantitative and qualitative measures, significantly outperforming both prior finetuning-based and finetuning-free personalization baselines. The model is able to generate high-quality results on challenging personalization tasks, even with a single reference image. The model is implemented using StableDiffusion V1.4 and trained on a large-scale synthetic dataset called S³, which contains image-text pairs with the same subject. The model is able to generate images with consistent high-level semantic features and low-level attributes. The model is also able to generate images that are faithful to the input text prompts and reference images. The model is able to handle challenging cases involving unique subjects and generate personalized images with well-preserved details. The model is able to outperform other methods in terms of image alignment, as evidenced by considerably higher DINO and MDINO scores. The model is also able to handle a wide range of subjects and generate high-quality images with diverse content while preserving the key visual features of the subjects in input images. The model is able to generate images that are faithful to the input text prompts and reference images, even for challenging uncommon objects. The model is able to generate images that are faithful to the input text prompts and reference images, even for challenging uncommon objects. The model is able to generate images that are faithful to the input text prompts and reference images, even for challenging uncommon objects.JeDi is a finetuning-free personalized text-to-image generation model that can generate images based on multiple reference images without requiring any additional modules or optimization. The model learns the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, a scalable synthetic dataset generation technique is proposed. Once trained, the model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. The model does not require any expensive optimization process and can faithfully preserve the identity represented by any number of reference images. Experimental results show that the model achieves state-of-the-art performance in both quantitative and qualitative measures, significantly outperforming both prior finetuning-based and finetuning-free personalization baselines. The model is able to generate high-quality results on challenging personalization tasks, even with a single reference image. The model is implemented using StableDiffusion V1.4 and trained on a large-scale synthetic dataset called S³, which contains image-text pairs with the same subject. The model is able to generate images with consistent high-level semantic features and low-level attributes. The model is also able to generate images that are faithful to the input text prompts and reference images. The model is able to handle challenging cases involving unique subjects and generate personalized images with well-preserved details. The model is able to outperform other methods in terms of image alignment, as evidenced by considerably higher DINO and MDINO scores. The model is also able to handle a wide range of subjects and generate high-quality images with diverse content while preserving the key visual features of the subjects in input images. The model is able to generate images that are faithful to the input text prompts and reference images, even for challenging uncommon objects. The model is able to generate images that are faithful to the input text prompts and reference images, even for challenging uncommon objects. The model is able to generate images that are faithful to the input text prompts and reference images, even for challenging uncommon objects.