AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

11/2024 | Max Ku*, Cong Wei*, Weiming Ren*, Harry Yang, Wenhu Chen
AnyV2V is a novel, tuning-free framework for video-to-video editing tasks. It simplifies video editing into two main steps: (1) modifying the first frame using an off-the-shelf image editing model, and (2) generating the edited video using an existing image-to-video generation model with temporal feature injection. AnyV2V leverages existing image editing tools to support a wide range of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation. It can handle any video length and achieves performance comparable to other baseline methods in terms of CLIP-scores. Human evaluations show that AnyV2V significantly outperforms existing methods in visual consistency and quality across various editing tasks. The framework is compatible with a wide range of image editing models and does not require any fine-tuning or additional video features. AnyV2V's design enables it to perform video editing tasks beyond the scope of current publicly available methods. The framework is evaluated on multiple video editing tasks and demonstrates superior performance in terms of both quantitative and qualitative metrics. AnyV2V is favored in 69.7% of samples for prompt alignment and 46.2% overall preference in human evaluation. The framework's design allows it to harness the power of off-the-shelf image editing models, enabling it to perform video editing tasks with high accuracy and consistency. The framework's key contributions include proposing AnyV2V as a fundamentally different solution for video editing, demonstrating its ability to support long video editing by inverting videos beyond the training frame lengths of I2V models, and showcasing its superior performance compared to existing SOTA methods.AnyV2V is a novel, tuning-free framework for video-to-video editing tasks. It simplifies video editing into two main steps: (1) modifying the first frame using an off-the-shelf image editing model, and (2) generating the edited video using an existing image-to-video generation model with temporal feature injection. AnyV2V leverages existing image editing tools to support a wide range of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation. It can handle any video length and achieves performance comparable to other baseline methods in terms of CLIP-scores. Human evaluations show that AnyV2V significantly outperforms existing methods in visual consistency and quality across various editing tasks. The framework is compatible with a wide range of image editing models and does not require any fine-tuning or additional video features. AnyV2V's design enables it to perform video editing tasks beyond the scope of current publicly available methods. The framework is evaluated on multiple video editing tasks and demonstrates superior performance in terms of both quantitative and qualitative metrics. AnyV2V is favored in 69.7% of samples for prompt alignment and 46.2% overall preference in human evaluation. The framework's design allows it to harness the power of off-the-shelf image editing models, enabling it to perform video editing tasks with high accuracy and consistency. The framework's key contributions include proposing AnyV2V as a fundamentally different solution for video editing, demonstrating its ability to support long video editing by inverting videos beyond the training frame lengths of I2V models, and showcasing its superior performance compared to existing SOTA methods.
Reach us at info@study.space