Understanding BootPIG%3A Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

BootPIG is a novel architecture designed to enable zero-shot subject-driven generation in pre-trained text-to-image diffusion models. It addresses the limitation of existing models, which require words to describe the desired concept, by allowing users to provide reference images of an object to guide the appearance of the generated concept. The architecture consists of two replicas of the pretrained text-to-image model: one for extracting visual features from reference images and another for actual image generation. BootPIG uses a separate UNet model to steer the generations towards the desired appearance. The training procedure involves bootstrapping personalization capabilities using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. Unlike existing methods that require several days of pretraining, BootPIG can be trained in approximately 1 hour on 16 A100 GPUs. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable to test-time finetuning approaches. User studies validate the preference for BootPIG generations over existing methods in terms of subject fidelity and prompt fidelity. The contributions of BootPIG include a novel architecture that enables zero-shot subject-driven generation, an effective bootstrapped learning procedure, and superior performance in personalized image generation.BootPIG is a novel architecture designed to enable zero-shot subject-driven generation in pre-trained text-to-image diffusion models. It addresses the limitation of existing models, which require words to describe the desired concept, by allowing users to provide reference images of an object to guide the appearance of the generated concept. The architecture consists of two replicas of the pretrained text-to-image model: one for extracting visual features from reference images and another for actual image generation. BootPIG uses a separate UNet model to steer the generations towards the desired appearance. The training procedure involves bootstrapping personalization capabilities using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. Unlike existing methods that require several days of pretraining, BootPIG can be trained in approximately 1 hour on 16 A100 GPUs. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable to test-time finetuning approaches. User studies validate the preference for BootPIG generations over existing methods in terms of subject fidelity and prompt fidelity. The contributions of BootPIG include a novel architecture that enables zero-shot subject-driven generation, an effective bootstrapped learning procedure, and superior performance in personalized image generation.

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

25 Jan 2024 | Senthil Purushwalkam, Akash Gokul, Shafiq Joty, Nikhil Naik