DragAnything: Motion Control for Anything using Entity Representation

DragAnything: Motion Control for Anything using Entity Representation

15 Mar 2024 | Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang
DragAnything is a method for motion control in controllable video generation using entity representation. It offers advantages over existing methods by providing a user-friendly trajectory-based interaction, enabling control of any object, including background, and allowing simultaneous control of multiple objects. The method uses entity representation, which is an open-domain embedding capable of representing any object, to achieve precise motion control. Extensive experiments show that DragAnything achieves state-of-the-art performance in terms of FVD, FID, and user studies, surpassing previous methods by 26% in human voting for motion control. The method supports interactive motion control for any entity in the video, including background. DragAnything is based on the Stable Video Diffusion (SVD) architecture and uses a denoising diffusion model, an encoder, and a decoder to generate videos. The entity representation is extracted from the first frame image using the entity mask and diffusion features. The method also incorporates a 2D Gaussian representation to enhance focus on the central region. Training involves generating ground truth labels from video segmentation datasets. DragAnything outperforms existing methods in video quality, temporal coherence, and object motion control. It supports various types of motion control, including foreground, background, and camera motion. Despite its effectiveness, DragAnything has some limitations, such as handling 3D motion and large-scale motion generation. The method is user-friendly, allowing users to select regions and drag points to control motion. The project website is at: DragAnything. Keywords: Motion Control · Controllable Video Generation · Diffusion Model.DragAnything is a method for motion control in controllable video generation using entity representation. It offers advantages over existing methods by providing a user-friendly trajectory-based interaction, enabling control of any object, including background, and allowing simultaneous control of multiple objects. The method uses entity representation, which is an open-domain embedding capable of representing any object, to achieve precise motion control. Extensive experiments show that DragAnything achieves state-of-the-art performance in terms of FVD, FID, and user studies, surpassing previous methods by 26% in human voting for motion control. The method supports interactive motion control for any entity in the video, including background. DragAnything is based on the Stable Video Diffusion (SVD) architecture and uses a denoising diffusion model, an encoder, and a decoder to generate videos. The entity representation is extracted from the first frame image using the entity mask and diffusion features. The method also incorporates a 2D Gaussian representation to enhance focus on the central region. Training involves generating ground truth labels from video segmentation datasets. DragAnything outperforms existing methods in video quality, temporal coherence, and object motion control. It supports various types of motion control, including foreground, background, and camera motion. Despite its effectiveness, DragAnything has some limitations, such as handling 3D motion and large-scale motion generation. The method is user-friendly, allowing users to select regions and drag points to control motion. The project website is at: DragAnything. Keywords: Motion Control · Controllable Video Generation · Diffusion Model.
Reach us at info@study.space
[slides] DragAnything%3A Motion Control for Anything using Entity Representation | StudySpace