Understanding Zero-shot Image Editing with Reference Imitation

This paper introduces a novel image editing method called imitative editing, which allows users to edit specific regions of an image by providing a reference image without needing to specify the exact regions to edit. The method, named MimicBrush, leverages a self-supervised training framework based on diffusion models to automatically find and imitate the corresponding parts in the reference image. The model uses two diffusion U-Nets: one to process the source image and another to process the reference image. The reference U-Net's attention keys and values are injected into the imitative U-Net to assist in completing the masked regions. The model is trained using video data, where two frames are randomly selected as the source and reference images. The source image is masked, and the model learns to recover the masked regions using the information from the reference image. The model is evaluated on a benchmark with two main tasks: part composition and texture transfer. The results show that MimicBrush outperforms existing methods in terms of fidelity and harmony. The model is also able to handle a wide range of applications, including product design, character creation, and special effects. The paper also discusses the limitations of the method, such as its inability to locate the reference region when it is too small or when there are multiple candidates in the reference image. The authors suggest that users should crop the reference image to zoom in on the desired regions in such cases. The model is expected to bring new inspiration for the community to explore more advanced techniques for image generation and editing.This paper introduces a novel image editing method called imitative editing, which allows users to edit specific regions of an image by providing a reference image without needing to specify the exact regions to edit. The method, named MimicBrush, leverages a self-supervised training framework based on diffusion models to automatically find and imitate the corresponding parts in the reference image. The model uses two diffusion U-Nets: one to process the source image and another to process the reference image. The reference U-Net's attention keys and values are injected into the imitative U-Net to assist in completing the masked regions. The model is trained using video data, where two frames are randomly selected as the source and reference images. The source image is masked, and the model learns to recover the masked regions using the information from the reference image. The model is evaluated on a benchmark with two main tasks: part composition and texture transfer. The results show that MimicBrush outperforms existing methods in terms of fidelity and harmony. The model is also able to handle a wide range of applications, including product design, character creation, and special effects. The paper also discusses the limitations of the method, such as its inability to locate the reference region when it is too small or when there are multiple candidates in the reference image. The authors suggest that users should crop the reference image to zoom in on the desired regions in such cases. The model is expected to bring new inspiration for the community to explore more advanced techniques for image generation and editing.

Zero-shot Image Editing with Reference Imitation

11 Jun 2024 | Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen