**FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation**
**Authors:** Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy
**Institution:** Wangxuan Institute of Computer Technology, Peking University; S-Lab, Nanyang Technological University
**Abstract:**
This paper introduces FRESCO, a zero-shot video translation framework that enhances the spatial-temporal correspondence of pre-trained image diffusion models. FRESCO addresses the limitations of existing methods by incorporating both intra-frame and inter-frame correspondences, ensuring more consistent and coherent transformations of semantically similar content across frames. The framework uses explicit feature updates to achieve high spatial-temporal consistency with the input video, significantly improving visual coherence. Extensive experiments demonstrate the effectiveness of FRESCO in producing high-quality, coherent videos, outperforming existing zero-shot methods.
**Introduction:**
The paper highlights the challenges in video manipulation, particularly in maintaining temporal consistency and natural motion. While existing zero-shot methods focus on refining attention mechanisms, they often suffer from issues like inconsistency, undercoverage, and inaccuracy. FRESCO introduces a novel approach by combining spatial and temporal correspondences, ensuring that semantically similar content is manipulated cohesively. This method enhances temporal consistency and maintains high controllability.
**Methodology:**
FRESCO is integrated into the inversion-free image translation pipeline of Stable Diffusion, adapting it for video translation. The framework focuses on the input features and attention modules of the decoder layers within the U-Net. It introduces FRESCO-guided feature optimization and attention adaptation to achieve high spatial and temporal coherence. The feature optimization involves spatial and temporal consistency losses, while the attention adaptation includes spatial-guided, efficient cross-frame, and temporal-guided attentions.
**Experiments:**
The paper compares FRESCO with state-of-the-art zero-shot video translation methods, showing superior performance in editing accuracy and temporal consistency. Ablation studies validate the contributions of different modules, demonstrating that the combination of all adaptations yields the best results. The framework also supports long video translation and video colorization, showcasing its versatility.
**Conclusion:**
FRESCO presents a robust zero-shot video translation framework that enhances spatial-temporal correspondence, leading to high-quality and coherent videos. The method's compatibility with existing image diffusion techniques suggests its potential in various text-guided video editing tasks.**FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation**
**Authors:** Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy
**Institution:** Wangxuan Institute of Computer Technology, Peking University; S-Lab, Nanyang Technological University
**Abstract:**
This paper introduces FRESCO, a zero-shot video translation framework that enhances the spatial-temporal correspondence of pre-trained image diffusion models. FRESCO addresses the limitations of existing methods by incorporating both intra-frame and inter-frame correspondences, ensuring more consistent and coherent transformations of semantically similar content across frames. The framework uses explicit feature updates to achieve high spatial-temporal consistency with the input video, significantly improving visual coherence. Extensive experiments demonstrate the effectiveness of FRESCO in producing high-quality, coherent videos, outperforming existing zero-shot methods.
**Introduction:**
The paper highlights the challenges in video manipulation, particularly in maintaining temporal consistency and natural motion. While existing zero-shot methods focus on refining attention mechanisms, they often suffer from issues like inconsistency, undercoverage, and inaccuracy. FRESCO introduces a novel approach by combining spatial and temporal correspondences, ensuring that semantically similar content is manipulated cohesively. This method enhances temporal consistency and maintains high controllability.
**Methodology:**
FRESCO is integrated into the inversion-free image translation pipeline of Stable Diffusion, adapting it for video translation. The framework focuses on the input features and attention modules of the decoder layers within the U-Net. It introduces FRESCO-guided feature optimization and attention adaptation to achieve high spatial and temporal coherence. The feature optimization involves spatial and temporal consistency losses, while the attention adaptation includes spatial-guided, efficient cross-frame, and temporal-guided attentions.
**Experiments:**
The paper compares FRESCO with state-of-the-art zero-shot video translation methods, showing superior performance in editing accuracy and temporal consistency. Ablation studies validate the contributions of different modules, demonstrating that the combination of all adaptations yields the best results. The framework also supports long video translation and video colorization, showcasing its versatility.
**Conclusion:**
FRESCO presents a robust zero-shot video translation framework that enhances spatial-temporal correspondence, leading to high-quality and coherent videos. The method's compatibility with existing image diffusion techniques suggests its potential in various text-guided video editing tasks.