This paper introduces a novel method for text-driven editing of natural images, combining a pretrained language-image model (CLIP) with a denoising diffusion probabilistic model (DDPM) to achieve realistic, region-based edits. The method uses a natural language description and an ROI mask to guide edits, ensuring that the edited region aligns with the text prompt while preserving the rest of the image. A key innovation is the use of spatial blending between the input image and the text-guided diffusion latent at various noise levels, which ensures seamless integration of the edited region. Additionally, the method incorporates augmentations during the diffusion process to reduce adversarial results. The approach outperforms existing methods in terms of realism, background preservation, and text alignment. The method is demonstrated through various applications, including adding/removing objects, background replacement, and image extrapolation. The paper also discusses limitations, such as inference time and potential biases inherited from CLIP, and suggests future research directions, including extending the method to other modalities like 3D objects and videos. The work highlights the potential of text-driven image editing as an intuitive and powerful tool for content creators.This paper introduces a novel method for text-driven editing of natural images, combining a pretrained language-image model (CLIP) with a denoising diffusion probabilistic model (DDPM) to achieve realistic, region-based edits. The method uses a natural language description and an ROI mask to guide edits, ensuring that the edited region aligns with the text prompt while preserving the rest of the image. A key innovation is the use of spatial blending between the input image and the text-guided diffusion latent at various noise levels, which ensures seamless integration of the edited region. Additionally, the method incorporates augmentations during the diffusion process to reduce adversarial results. The approach outperforms existing methods in terms of realism, background preservation, and text alignment. The method is demonstrated through various applications, including adding/removing objects, background replacement, and image extrapolation. The paper also discusses limitations, such as inference time and potential biases inherited from CLIP, and suggests future research directions, including extending the method to other modalities like 3D objects and videos. The work highlights the potential of text-driven image editing as an intuitive and powerful tool for content creators.