7 Apr 2024 | Jianhong Bai *1 Tianyu He † 2 Yuchi Wang 3 Junliang Guo 2 Haoji Hu 1 Zuozhu Liu 1 Jiang Bian 2
UniEdit is a tuning-free framework designed for both video motion and appearance editing. It leverages a pre-trained text-to-video generator within an inversion-then-generation framework to achieve these tasks. To address the challenge of preserving source video content while editing motion, UniEdit introduces auxiliary motion-reference and reconstruction branches. The motion-reference branch generates text-guided motion features, which are injected into the main editing path via temporal self-attention layers. The reconstruction branch produces source features for content preservation, which are injected into the main editing path via spatial self-attention layers. This approach ensures that the edited video retains the original content while aligning with the target motion prompt. For appearance editing, UniEdit maintains the spatial structure of the source video by replacing the spatial attention maps in the main editing path with those from the reconstruction branch. Extensive experiments demonstrate that UniEdit outperforms state-of-the-art methods in both motion and appearance editing tasks, showcasing its superior performance in content preservation and temporal consistency. The code for UniEdit will be publicly available.UniEdit is a tuning-free framework designed for both video motion and appearance editing. It leverages a pre-trained text-to-video generator within an inversion-then-generation framework to achieve these tasks. To address the challenge of preserving source video content while editing motion, UniEdit introduces auxiliary motion-reference and reconstruction branches. The motion-reference branch generates text-guided motion features, which are injected into the main editing path via temporal self-attention layers. The reconstruction branch produces source features for content preservation, which are injected into the main editing path via spatial self-attention layers. This approach ensures that the edited video retains the original content while aligning with the target motion prompt. For appearance editing, UniEdit maintains the spatial structure of the source video by replacing the spatial attention maps in the main editing path with those from the reconstruction branch. Extensive experiments demonstrate that UniEdit outperforms state-of-the-art methods in both motion and appearance editing tasks, showcasing its superior performance in content preservation and temporal consistency. The code for UniEdit will be publicly available.