28 Jul 2024 | Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi
**Abstract:**
We introduce DragAPart, a method that, given an image and a set of drags as input, generates a new image of the same object that responds to the action of the drags. Unlike prior works that focus on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. We start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.
**Introduction:**
We consider the problem of learning an interactive image generator that allows moving the parts of an object by dragging them. For example, dragging on the door of a cabinet should result in the image of the same cabinet but with the door open. Besides applications to controlled image generation, we explore dragging as a way of learning and probing generalist models of motion. Modelling deformable objects often uses ad-hoc models specific to each object type. In contrast, foundation models like CLIP, GPT-4, DALL-E, and Stable Diffusion take a generalist approach. We hypothesize that a model of motion does not require a template; it is enough that the model understands the possible physical configurations of an object and their transitions. Dragging provides a way to probe such a model without using a template.
**Method:**
We develop DragAPart, an interactive generative model that, given a single object-centric RGB image and one or more drags, synthesizes a second image that reflects the effect of the drags. Key to our model is to fine-tune the motion generator, which models the distribution, on a synthetic dataset of triplets. This dataset is built using an existing 3D synthetic dataset with rich part-level annotations. We propose a new encoding for the drags, which enables more efficient information propagation in the model. We also propose a domain randomization strategy to improve the model's performance on real-world images.
**Experiments:**
We show that our method outperforms prior works both quantitatively and qualitatively. Ablation studies validate our design choices. We also demonstrate some downstream applications of DragAPart, including segmenting moving parts and analyzing motion for articulated objects.
**Conclusion:**
We presented DragAPart, an image generator that uses drags as an interface for part-level dynamics. By using a small amount of synthetic data and domain randomization, DragAPart can respond to drags by interpreting them as fine-grained part-level interactions with the underlying object. Partly thanks to a new drag encoder, we have obtained better results than other methods on this task**Abstract:**
We introduce DragAPart, a method that, given an image and a set of drags as input, generates a new image of the same object that responds to the action of the drags. Unlike prior works that focus on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. We start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.
**Introduction:**
We consider the problem of learning an interactive image generator that allows moving the parts of an object by dragging them. For example, dragging on the door of a cabinet should result in the image of the same cabinet but with the door open. Besides applications to controlled image generation, we explore dragging as a way of learning and probing generalist models of motion. Modelling deformable objects often uses ad-hoc models specific to each object type. In contrast, foundation models like CLIP, GPT-4, DALL-E, and Stable Diffusion take a generalist approach. We hypothesize that a model of motion does not require a template; it is enough that the model understands the possible physical configurations of an object and their transitions. Dragging provides a way to probe such a model without using a template.
**Method:**
We develop DragAPart, an interactive generative model that, given a single object-centric RGB image and one or more drags, synthesizes a second image that reflects the effect of the drags. Key to our model is to fine-tune the motion generator, which models the distribution, on a synthetic dataset of triplets. This dataset is built using an existing 3D synthetic dataset with rich part-level annotations. We propose a new encoding for the drags, which enables more efficient information propagation in the model. We also propose a domain randomization strategy to improve the model's performance on real-world images.
**Experiments:**
We show that our method outperforms prior works both quantitatively and qualitatively. Ablation studies validate our design choices. We also demonstrate some downstream applications of DragAPart, including segmenting moving parts and analyzing motion for articulated objects.
**Conclusion:**
We presented DragAPart, an image generator that uses drags as an interface for part-level dynamics. By using a small amount of synthetic data and domain randomization, DragAPart can respond to drags by interpreting them as fine-grained part-level interactions with the underlying object. Partly thanks to a new drag encoder, we have obtained better results than other methods on this task