Understanding ObjectDrop%3A Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

**ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion** **Authors:** Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, Yedid Hoshen **Institution:** Google Research, The Hebrew University of Jerusalem **Abstract:** Diffusion models have revolutionized image editing but often generate images that violate physical laws, particularly the effects of objects on the scene. To address this, the authors propose a practical solution centered on a "counterfactual" dataset, capturing scenes before and after removing a single object while minimizing other changes. By fine-tuning a diffusion model on this dataset, they achieve photorealistic object removal and insertion. However, applying this approach for photorealistic object insertion requires an impractically large dataset. To tackle this, they propose *bootstrap supervision*, leveraging their object removal model trained on a small counterfactual dataset to synthetically expand this dataset. Their approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly in modeling the effects of objects on the scene. **Introduction:** Photorealistic image editing requires both visual appeal and physical plausibility. While diffusion-based editing models enhance aesthetic quality, they often fail to generate physically realistic images. Object removal and insertion are challenging tasks, and current methods struggle with modeling the effects of objects on the scene. The authors analyze the limitations of self-supervised editing approaches through the lens of counterfactual inference, where a counterfactual statement takes the form "if the object did not exist, this reflection would not occur." Accurate object removal and insertion require understanding what the scene would look like with and without the object. **Method:** The authors propose a practical approach that trains a diffusion model on a meticulously curated "counterfactual" dataset. Each sample includes a factual image and a counterfactual image, capturing the scene before and after removing an object. They create this dataset by physically altering the scene and capturing the resulting images. This approach ensures that each example reflects only the scene changes related to the presence of the object. For object insertion, which requires synthesizing shadows and reflections, the authors propose a two-step approach. First, they train an object removal model using a smaller counterfactual dataset. Second, they apply the removal model on a large unlabeled image dataset to create a vast synthetic dataset. They fine-tune a diffusion model on this synthetic dataset to add realistic shadows and reflections around newly inserted objects. This approach, called *bootstrap supervision*, significantly improves the quality of object insertion. **Contributions:** 1. An analysis of the limitations of self-supervised training for editing the effects of objects on scenes. 2. An effective counterfactual supervised training method for photorealistic object removal. 3. A bootstrap supervision approach to mitigate the labeling burden for object insertion. **Experiments:** The authors evaluate their method on**ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion** **Authors:** Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, Yedid Hoshen **Institution:** Google Research, The Hebrew University of Jerusalem **Abstract:** Diffusion models have revolutionized image editing but often generate images that violate physical laws, particularly the effects of objects on the scene. To address this, the authors propose a practical solution centered on a "counterfactual" dataset, capturing scenes before and after removing a single object while minimizing other changes. By fine-tuning a diffusion model on this dataset, they achieve photorealistic object removal and insertion. However, applying this approach for photorealistic object insertion requires an impractically large dataset. To tackle this, they propose *bootstrap supervision*, leveraging their object removal model trained on a small counterfactual dataset to synthetically expand this dataset. Their approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly in modeling the effects of objects on the scene. **Introduction:** Photorealistic image editing requires both visual appeal and physical plausibility. While diffusion-based editing models enhance aesthetic quality, they often fail to generate physically realistic images. Object removal and insertion are challenging tasks, and current methods struggle with modeling the effects of objects on the scene. The authors analyze the limitations of self-supervised editing approaches through the lens of counterfactual inference, where a counterfactual statement takes the form "if the object did not exist, this reflection would not occur." Accurate object removal and insertion require understanding what the scene would look like with and without the object. **Method:** The authors propose a practical approach that trains a diffusion model on a meticulously curated "counterfactual" dataset. Each sample includes a factual image and a counterfactual image, capturing the scene before and after removing an object. They create this dataset by physically altering the scene and capturing the resulting images. This approach ensures that each example reflects only the scene changes related to the presence of the object. For object insertion, which requires synthesizing shadows and reflections, the authors propose a two-step approach. First, they train an object removal model using a smaller counterfactual dataset. Second, they apply the removal model on a large unlabeled image dataset to create a vast synthetic dataset. They fine-tune a diffusion model on this synthetic dataset to add realistic shadows and reflections around newly inserted objects. This approach, called *bootstrap supervision*, significantly improves the quality of object insertion. **Contributions:** 1. An analysis of the limitations of self-supervised training for editing the effects of objects on scenes. 2. An effective counterfactual supervised training method for photorealistic object removal. 3. A bootstrap supervision approach to mitigate the labeling burden for object insertion. **Experiments:** The authors evaluate their method on

ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

27 Mar 2024 | Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, Yedid Hoshen