WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting

WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting

2024 | Zhiliang Wu, Changchang Sun, Hanyu Xuan, Gaowen Liu, Yan Yan
**WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting** Video inpainting aims to fill missing regions in video frames with plausible content. Transformer-based models have achieved significant performance improvements due to their long-range modeling capacity. However, attention retrieval accuracy remains a bottleneck, influenced by factors such as noise in the embeddings. This paper theoretically demonstrates that noise negatively affects attention calculation, leading to inaccurate retrieval of relevant content and introducing irrelevant content. To address this issue, the authors propose WaveFormer, a novel wavelet transformer network designed to be robust to noise. Unlike existing transformer-based methods that use the entire embedding for attention calculation, WaveFormer employs the Discrete Wavelet Transform (DWT) to separate noise into high-frequency and low-frequency components. The low-frequency components, which are cleaner and more relevant, are used to calculate attention, while the high-frequency components are shared to generate missing content across different frequencies. This approach significantly mitigates the impact of noise on attention computation. Experiments on two benchmark datasets, YouTube-VOS and DAVIS, show that WaveFormer outperforms state-of-the-art methods in terms of PSNR, SSIM, LPIPS, and flow warping error. The method also demonstrates robustness to different types of masks and noise, generating visually plausible and spatially-temporally coherent content with fine-grained details. The contributions of the paper include: 1. Theoretical demonstration that noise negatively affects attention calculation in video inpainting. 2. Introduction of WaveFormer, a wavelet transformer network that mitigates the impact of noise on attention computation. 3. Quantitative and qualitative validation of WaveFormer's superior performance over state-of-the-art methods.**WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting** Video inpainting aims to fill missing regions in video frames with plausible content. Transformer-based models have achieved significant performance improvements due to their long-range modeling capacity. However, attention retrieval accuracy remains a bottleneck, influenced by factors such as noise in the embeddings. This paper theoretically demonstrates that noise negatively affects attention calculation, leading to inaccurate retrieval of relevant content and introducing irrelevant content. To address this issue, the authors propose WaveFormer, a novel wavelet transformer network designed to be robust to noise. Unlike existing transformer-based methods that use the entire embedding for attention calculation, WaveFormer employs the Discrete Wavelet Transform (DWT) to separate noise into high-frequency and low-frequency components. The low-frequency components, which are cleaner and more relevant, are used to calculate attention, while the high-frequency components are shared to generate missing content across different frequencies. This approach significantly mitigates the impact of noise on attention computation. Experiments on two benchmark datasets, YouTube-VOS and DAVIS, show that WaveFormer outperforms state-of-the-art methods in terms of PSNR, SSIM, LPIPS, and flow warping error. The method also demonstrates robustness to different types of masks and noise, generating visually plausible and spatially-temporally coherent content with fine-grained details. The contributions of the paper include: 1. Theoretical demonstration that noise negatively affects attention calculation in video inpainting. 2. Introduction of WaveFormer, a wavelet transformer network that mitigates the impact of noise on attention computation. 3. Quantitative and qualitative validation of WaveFormer's superior performance over state-of-the-art methods.
Reach us at info@study.space
[slides] WaveFormer%3A Wavelet Transformer for Noise-Robust Video Inpainting | StudySpace