2 Aug 2022 | Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or
This paper introduces a prompt-to-prompt image editing method that allows users to edit images using only text prompts, without requiring masks or additional specifications. The method leverages cross-attention mechanisms in text-conditioned diffusion models to control how text prompts influence the generated image. By modifying the cross-attention maps during the diffusion process, the method enables localized and global edits, such as replacing words in the prompt, adding new descriptions, or adjusting the influence of specific words on the generated image. The approach preserves the original structure and content of the image while adapting to the new prompt. The method is demonstrated on a variety of images and prompts, showing high-quality synthesis and fidelity to the edited prompts. The paper also discusses applications of the method, including real image editing, where the inversion process is used to generate the initial noise vector for real images. The method is shown to be effective in preserving key elements of the image while allowing for intuitive text-based editing. The paper concludes that the method provides a simple and intuitive way for users to edit images using text prompts, leveraging the semantic power of text to control the generation process.This paper introduces a prompt-to-prompt image editing method that allows users to edit images using only text prompts, without requiring masks or additional specifications. The method leverages cross-attention mechanisms in text-conditioned diffusion models to control how text prompts influence the generated image. By modifying the cross-attention maps during the diffusion process, the method enables localized and global edits, such as replacing words in the prompt, adding new descriptions, or adjusting the influence of specific words on the generated image. The approach preserves the original structure and content of the image while adapting to the new prompt. The method is demonstrated on a variety of images and prompts, showing high-quality synthesis and fidelity to the edited prompts. The paper also discusses applications of the method, including real image editing, where the inversion process is used to generate the initial noise vector for real images. The method is shown to be effective in preserving key elements of the image while allowing for intuitive text-based editing. The paper concludes that the method provides a simple and intuitive way for users to edit images using text prompts, leveraging the semantic power of text to control the generation process.