Understanding Paint by Inpaint%3A Learning to Add Image Objects by Removing Them First

The paper "Paint by Inpaint: Learning to Add Image Objects by Removing Them First" addresses the challenge of seamlessly adding objects to images based on textual instructions without requiring user-provided input masks. The authors leverage the insight that removing objects (inpainting) is simpler than adding them, as it can be achieved using segmentation mask datasets and inpainting models. They curate a large-scale image dataset containing pairs of images and their corresponding object-removed versions, training a diffusion model to inverse the inpainting process and add objects to images. Unlike other datasets, theirs features natural target images and maintains consistency between source and target images. The model is trained using a combination of a Large Vision-Language Model (VLM) for detailed object descriptions and a Large Language Model (LLM) to convert these descriptions into natural language instructions. The trained model surpasses existing ones in both qualitative and quantitative evaluations, and the dataset is released for community use. The paper also discusses related efforts in image editing, including mask-based and mask-free editing methods, and provides a detailed methodology for creating the PIPE dataset and training the model. Experiments demonstrate the model's superior performance in object addition tasks, both quantitatively and through human evaluation.The paper "Paint by Inpaint: Learning to Add Image Objects by Removing Them First" addresses the challenge of seamlessly adding objects to images based on textual instructions without requiring user-provided input masks. The authors leverage the insight that removing objects (inpainting) is simpler than adding them, as it can be achieved using segmentation mask datasets and inpainting models. They curate a large-scale image dataset containing pairs of images and their corresponding object-removed versions, training a diffusion model to inverse the inpainting process and add objects to images. Unlike other datasets, theirs features natural target images and maintains consistency between source and target images. The model is trained using a combination of a Large Vision-Language Model (VLM) for detailed object descriptions and a Large Language Model (LLM) to convert these descriptions into natural language instructions. The trained model surpasses existing ones in both qualitative and quantitative evaluations, and the dataset is released for community use. The paper also discusses related efforts in image editing, including mask-based and mask-free editing methods, and provides a detailed methodology for creating the PIPE dataset and training the model. Experiments demonstrate the model's superior performance in object addition tasks, both quantitatively and through human evaluation.

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

28 Apr 2024 | Navve Wasserman, Noam Rotstein, Roy Ganz, and Ron Kimmel