15 Mar 2024 | Weijia Wu123, Zhuang Li1, Yuchao Gu3, Rui Zhao3, Yefei He2, David Junhao Zhang3, Mike Zheng Shou3†, Yan Li1, Tingting Gao1, and Di Zhang1
**Introduction:**
The paper introduces DragAnything, a method that uses entity representation to achieve motion control for any object in controllable video generation. Compared to existing methods, DragAnything offers several advantages, including user-friendliness, open-domain embedding, and simultaneous and distinct motion control for multiple objects.
**Entity Representation:**
The entity representation is a novel approach that extracts latent features from the diffusion model to represent each object. This representation allows for precise control of object motion, even when the provided trajectory points do not fully represent the intended entity.
**Methodology:**
1. **Task Formulation and Motivation:** The method is based on trajectory-based video generation, where the model synthesizes videos based on given motion trajectories. The guidance signal includes trajectory points, the first frame of the video, and the entity mask of the first frame.
2. **Architecture:** The architecture consists of a denoising diffusion model (3D U-Net), an encoder, and a decoder. The entity representation and 2D Gaussian representation are extracted and combined to achieve entity-level controllable generation.
3. **Entity Semantic Representation Extraction:** The entity representation is extracted from the first frame image using diffusion inversion and indexing the corresponding coordinates from the entity mask. The 2D Gaussian representation focuses more on the central region of the entity.
4. **Training and Inference:** Ground truth labels are generated using video segmentation datasets, and the model is trained to optimize the loss function, which includes the entity representation and 2D Gaussian representation.
**Experiments:**
1. **Evaluation Metrics:** The method is evaluated using FID, FVD, and ObjMC metrics, as well as a user study.
2. **Comparisons with State-of-the-Art Methods:** DragAnything outperforms existing methods in terms of video quality, temporal coherence, and object motion control.
3. **Ablation Studies:** The effectiveness of the entity representation and 2D Gaussian representation is demonstrated through ablation studies.
**Conclusion:**
DragAnything achieves state-of-the-art performance in controllable video generation, particularly in object motion control, surpassing previous methods by 26% in human voting. The method is flexible and user-friendly, supporting diverse motion control for any entity in the video.**Introduction:**
The paper introduces DragAnything, a method that uses entity representation to achieve motion control for any object in controllable video generation. Compared to existing methods, DragAnything offers several advantages, including user-friendliness, open-domain embedding, and simultaneous and distinct motion control for multiple objects.
**Entity Representation:**
The entity representation is a novel approach that extracts latent features from the diffusion model to represent each object. This representation allows for precise control of object motion, even when the provided trajectory points do not fully represent the intended entity.
**Methodology:**
1. **Task Formulation and Motivation:** The method is based on trajectory-based video generation, where the model synthesizes videos based on given motion trajectories. The guidance signal includes trajectory points, the first frame of the video, and the entity mask of the first frame.
2. **Architecture:** The architecture consists of a denoising diffusion model (3D U-Net), an encoder, and a decoder. The entity representation and 2D Gaussian representation are extracted and combined to achieve entity-level controllable generation.
3. **Entity Semantic Representation Extraction:** The entity representation is extracted from the first frame image using diffusion inversion and indexing the corresponding coordinates from the entity mask. The 2D Gaussian representation focuses more on the central region of the entity.
4. **Training and Inference:** Ground truth labels are generated using video segmentation datasets, and the model is trained to optimize the loss function, which includes the entity representation and 2D Gaussian representation.
**Experiments:**
1. **Evaluation Metrics:** The method is evaluated using FID, FVD, and ObjMC metrics, as well as a user study.
2. **Comparisons with State-of-the-Art Methods:** DragAnything outperforms existing methods in terms of video quality, temporal coherence, and object motion control.
3. **Ablation Studies:** The effectiveness of the entity representation and 2D Gaussian representation is demonstrated through ablation studies.
**Conclusion:**
DragAnything achieves state-of-the-art performance in controllable video generation, particularly in object motion control, surpassing previous methods by 26% in human voting. The method is flexible and user-friendly, supporting diverse motion control for any entity in the video.