FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

19 Mar 2024 | Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation This paper introduces FRESCO, a novel zero-shot video translation framework that enhances spatial and temporal consistency in video editing. The framework leverages pre-trained image diffusion models to translate videos based on text prompts, preserving semantic content and motion. FRESCO integrates both intra-frame and inter-frame correspondences to establish robust spatial-temporal constraints, ensuring consistent transformation of semantically similar content across frames. The method involves explicit feature updates to achieve high spatial-temporal consistency with the input video, significantly improving visual coherence of the translated videos. The framework is compatible with various assistive techniques like ControlNet, SDEdit, and LoRA, enabling more flexible and customized video translation. FRESCO introduces two levels of enhancement: attention and feature. At the attention level, FRESCO-guided attention builds upon optical flow guidance and enriches the attention mechanism by integrating self-similarity of the input frame. At the feature level, FRESCO-aware feature optimization involves explicit updates of semantically meaningful features in the U-Net decoder layers, achieved through gradient descent to align closely with the high spatial-temporal consistency of the input video. The method addresses three critical issues: inconsistency, undercoverage, and inaccuracy. By incorporating intra-frame spatial correspondence, FRESCO ensures that semantically similar content is manipulated cohesively, maintaining its similarity post-translation. This strategy effectively addresses the first two challenges: it prevents the foreground from being erroneously translated into the background, and it enhances the consistency of the optical flow. For regions where optical flow is not available, the spatial correspondence within the original frame can serve as a regulatory mechanism. For long video translation, the framework uses a heuristic approach for keyframe selection and employs interpolation for non-keyframe frames. The main contributions include a novel zero-shot diffusion framework guided by frame spatial-temporal correspondence for coherent and flexible video translation, combining FRESCO-guided feature attention and optimization as a robust intra-and inter-frame constraint with better consistency and coverage than optical flow alone, and long video translation by jointly processing batched frames with inter-batch consistency. The proposed method is evaluated on various benchmarks and shows significant improvements in video translation quality compared to existing zero-shot methods. The framework is compatible with existing image diffusion techniques, suggesting its potential application in other text-guided video editing tasks, such as video super-resolution and colorization.FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation This paper introduces FRESCO, a novel zero-shot video translation framework that enhances spatial and temporal consistency in video editing. The framework leverages pre-trained image diffusion models to translate videos based on text prompts, preserving semantic content and motion. FRESCO integrates both intra-frame and inter-frame correspondences to establish robust spatial-temporal constraints, ensuring consistent transformation of semantically similar content across frames. The method involves explicit feature updates to achieve high spatial-temporal consistency with the input video, significantly improving visual coherence of the translated videos. The framework is compatible with various assistive techniques like ControlNet, SDEdit, and LoRA, enabling more flexible and customized video translation. FRESCO introduces two levels of enhancement: attention and feature. At the attention level, FRESCO-guided attention builds upon optical flow guidance and enriches the attention mechanism by integrating self-similarity of the input frame. At the feature level, FRESCO-aware feature optimization involves explicit updates of semantically meaningful features in the U-Net decoder layers, achieved through gradient descent to align closely with the high spatial-temporal consistency of the input video. The method addresses three critical issues: inconsistency, undercoverage, and inaccuracy. By incorporating intra-frame spatial correspondence, FRESCO ensures that semantically similar content is manipulated cohesively, maintaining its similarity post-translation. This strategy effectively addresses the first two challenges: it prevents the foreground from being erroneously translated into the background, and it enhances the consistency of the optical flow. For regions where optical flow is not available, the spatial correspondence within the original frame can serve as a regulatory mechanism. For long video translation, the framework uses a heuristic approach for keyframe selection and employs interpolation for non-keyframe frames. The main contributions include a novel zero-shot diffusion framework guided by frame spatial-temporal correspondence for coherent and flexible video translation, combining FRESCO-guided feature attention and optimization as a robust intra-and inter-frame constraint with better consistency and coverage than optical flow alone, and long video translation by jointly processing batched frames with inter-batch consistency. The proposed method is evaluated on various benchmarks and shows significant improvements in video translation quality compared to existing zero-shot methods. The framework is compatible with existing image diffusion techniques, suggesting its potential application in other text-guided video editing tasks, such as video super-resolution and colorization.
Reach us at info@study.space
Understanding Fresco%3A Spatial-Temporal Correspondence for Zero-Shot Video Translation