BootPIG is a novel approach that enables zero-shot personalized image generation in pretrained text-to-image diffusion models. The method introduces a bootstrapped training process that uses images generated by the text-to-image model to train the model to generate images that match a specific subject. BootPIG consists of two models: a Reference UNet that extracts features from reference images and a Base UNet that generates images based on these features. The Reference UNet is trained to extract features that can be used by the Base UNet to generate images that match the reference object. The training process uses synthetic data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. BootPIG can be trained in approximately one hour on 16 A100 GPUs, which is significantly faster than existing methods that require several days of pretraining. Experiments on the DreamBooth dataset show that BootPIG outperforms existing zero-shot methods while being comparable to test-time finetuning approaches. A user study also shows that users prefer BootPIG generations over existing methods in terms of fidelity to the reference object and alignment with textual prompts. BootPIG's architecture allows for efficient training and generates high-fidelity images that match the reference object without requiring test-time finetuning. The method is effective in generating personalized images and has the potential to be used in various applications such as personalized storytelling and interactive design.BootPIG is a novel approach that enables zero-shot personalized image generation in pretrained text-to-image diffusion models. The method introduces a bootstrapped training process that uses images generated by the text-to-image model to train the model to generate images that match a specific subject. BootPIG consists of two models: a Reference UNet that extracts features from reference images and a Base UNet that generates images based on these features. The Reference UNet is trained to extract features that can be used by the Base UNet to generate images that match the reference object. The training process uses synthetic data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. BootPIG can be trained in approximately one hour on 16 A100 GPUs, which is significantly faster than existing methods that require several days of pretraining. Experiments on the DreamBooth dataset show that BootPIG outperforms existing zero-shot methods while being comparable to test-time finetuning approaches. A user study also shows that users prefer BootPIG generations over existing methods in terms of fidelity to the reference object and alignment with textual prompts. BootPIG's architecture allows for efficient training and generates high-fidelity images that match the reference object without requiring test-time finetuning. The method is effective in generating personalized images and has the potential to be used in various applications such as personalized storytelling and interactive design.