20 Mar 2023 | Bahjat Kawar* 1,2, Shiran Zada* 1, Huiwen Chang1, Tali Dekel1,3, Oran Lang1, Inbar Mosseri1, Omer Tov1, Michal Irani1,3
Imagic is a text-based real image editing method that enables complex non-rigid semantic edits on a single real image. The method uses a pre-trained text-to-image diffusion model to generate text embeddings that align with both the input image and the target text. It then fine-tunes the diffusion model to better reconstruct the input image and performs linear interpolation between the optimized embedding and the target text embedding to generate the final edited image. Imagic can perform various edits such as changing the posture and composition of objects, altering style, color, and adding objects. It requires only a single input image and a target text, without additional inputs like image masks or multiple views of the object. The method is evaluated on numerous real images from various domains, demonstrating high-quality and versatile edits. A benchmark called TEdBench is introduced to assess the performance of text-based image editing methods. A user study shows that human raters prefer Imagic over previous leading methods on TEdBench. Imagic is the first method to apply sophisticated text-based edits on a single real image, preserving the original image's structure and composition. The method is implemented using two state-of-the-art text-to-image diffusion models: Imagen and Stable Diffusion. The method is evaluated on various real images, showing high fidelity to the input image and alignment with the target text. The method is also tested on different text prompts for the same image, demonstrating its versatility. The method is compared to other text-based image editing methods, showing superior performance in terms of editing quality and fidelity to the original image. The method is also evaluated on a user study, where human raters strongly prefer Imagic over other methods on TEdBench. The method has limitations, such as subtle edits not aligning well with the target text or affecting extrinsic image details like zoom or camera angle. These limitations can be mitigated by optimizing the text embedding or the diffusion model differently, or by incorporating cross-attention control. The method inherits the generative limitations and biases of the underlying text-to-image diffusion model, which can lead to unwanted artifacts in certain cases. The method is designed to enable complex editing of real-world images using textual descriptions of the target edit, but it is prone to societal biases of the underlying text-based generative models. However, since the method relies mostly on the input image for editing, it is less prone to biases than purely generative methods. The method may be used by malicious parties to synthesize fake imagery to mislead viewers, so further research on the identification of synthetically edited or generated content is needed.Imagic is a text-based real image editing method that enables complex non-rigid semantic edits on a single real image. The method uses a pre-trained text-to-image diffusion model to generate text embeddings that align with both the input image and the target text. It then fine-tunes the diffusion model to better reconstruct the input image and performs linear interpolation between the optimized embedding and the target text embedding to generate the final edited image. Imagic can perform various edits such as changing the posture and composition of objects, altering style, color, and adding objects. It requires only a single input image and a target text, without additional inputs like image masks or multiple views of the object. The method is evaluated on numerous real images from various domains, demonstrating high-quality and versatile edits. A benchmark called TEdBench is introduced to assess the performance of text-based image editing methods. A user study shows that human raters prefer Imagic over previous leading methods on TEdBench. Imagic is the first method to apply sophisticated text-based edits on a single real image, preserving the original image's structure and composition. The method is implemented using two state-of-the-art text-to-image diffusion models: Imagen and Stable Diffusion. The method is evaluated on various real images, showing high fidelity to the input image and alignment with the target text. The method is also tested on different text prompts for the same image, demonstrating its versatility. The method is compared to other text-based image editing methods, showing superior performance in terms of editing quality and fidelity to the original image. The method is also evaluated on a user study, where human raters strongly prefer Imagic over other methods on TEdBench. The method has limitations, such as subtle edits not aligning well with the target text or affecting extrinsic image details like zoom or camera angle. These limitations can be mitigated by optimizing the text embedding or the diffusion model differently, or by incorporating cross-attention control. The method inherits the generative limitations and biases of the underlying text-to-image diffusion model, which can lead to unwanted artifacts in certain cases. The method is designed to enable complex editing of real-world images using textual descriptions of the target edit, but it is prone to societal biases of the underlying text-based generative models. However, since the method relies mostly on the input image for editing, it is less prone to biases than purely generative methods. The method may be used by malicious parties to synthesize fake imagery to mislead viewers, so further research on the identification of synthetically edited or generated content is needed.