3 Jun 2024 | OMRI AVRAHAMI, NVIDIA, The Hebrew University of Jerusalem; RINON GAL, NVIDIA, Tel Aviv University; GAL CHECHIK, NVIDIA; OHAD FRIED, Reichman University; DANI LISCINSKI, The Hebrew University of Jerusalem; ARASH VAHDAT, NVIDIA; WEILI NIE, NVIDIA
DiffUHaul is a training-free method for object dragging in images, leveraging the spatial understanding of a localized text-to-image model. The method addresses the challenge of seamlessly relocating objects within a scene by using attention masking and self-attention sharing to disentangle object representations and preserve high-level object appearance. A novel diffusion anchoring technique is introduced, where early denoising steps interpolate attention features between source and target images to smoothly fuse new layouts with the original appearance, while later steps retain fine-grained object details by passing localized features from the source images. To adapt to real-image editing, a DDPM self-attention bucketing technique is used to better reconstruct real images with the localized model. An automated evaluation pipeline is introduced, and results are reinforced through a user preference study. The method demonstrates robustness in object dragging, preserving the foreground and background appearance while achieving high-quality results. The contributions include showing the effectiveness of localized text-to-image models for object dragging, revealing an entanglement problem in gated self-attention layers, introducing a novel soft anchoring mechanism, and demonstrating that DDPM self-attention bucketing suffices for real image editing. The method is evaluated against several baselines in terms of foreground similarity, object traces, and realism, showing superior performance. The method is also preferred by human evaluators in a user study. Limitations include difficulties with object rotation, resizing, and handling colliding objects.DiffUHaul is a training-free method for object dragging in images, leveraging the spatial understanding of a localized text-to-image model. The method addresses the challenge of seamlessly relocating objects within a scene by using attention masking and self-attention sharing to disentangle object representations and preserve high-level object appearance. A novel diffusion anchoring technique is introduced, where early denoising steps interpolate attention features between source and target images to smoothly fuse new layouts with the original appearance, while later steps retain fine-grained object details by passing localized features from the source images. To adapt to real-image editing, a DDPM self-attention bucketing technique is used to better reconstruct real images with the localized model. An automated evaluation pipeline is introduced, and results are reinforced through a user preference study. The method demonstrates robustness in object dragging, preserving the foreground and background appearance while achieving high-quality results. The contributions include showing the effectiveness of localized text-to-image models for object dragging, revealing an entanglement problem in gated self-attention layers, introducing a novel soft anchoring mechanism, and demonstrating that DDPM self-attention bucketing suffices for real image editing. The method is evaluated against several baselines in terms of foreground similarity, object traces, and realism, showing superior performance. The method is also preferred by human evaluators in a user study. Limitations include difficulties with object rotation, resizing, and handling colliding objects.