Understanding WaveFormer%3A Wavelet Transformer for Noise-Robust Video Inpainting

WaveFormer is a wavelet transformer designed for noise-robust video inpainting. The paper highlights that noise significantly affects the attention calculation in transformer-based video inpainting, leading to inaccurate results. To address this, WaveFormer introduces Discrete Wavelet Transform (DWT) to separate noise into high-frequency components, allowing the use of clean low-frequency components for attention calculation. This approach mitigates the impact of noise on attention computation, enabling the generation of high-quality inpainted videos. The method is evaluated on two benchmark datasets, YouTube-VOS and DAVIS, showing superior performance in terms of PSNR, SSIM, and flow warping error. WaveFormer outperforms existing methods by a significant margin, with relative improvements of 7.45% and 9.48% in PSNR and $ E_{warp} $, respectively. The method is also robust to noise, producing visually plausible and spatial-temporally coherent results. The contributions include theoretically demonstrating the negative impact of noise on attention calculation, proposing a novel wavelet transformer network, and showing the effectiveness of the method on benchmark datasets. The paper also discusses the methodology, including the wavelet spatial-temporal transformer, loss function, and experimental results, demonstrating the effectiveness of the proposed approach in video inpainting.WaveFormer is a wavelet transformer designed for noise-robust video inpainting. The paper highlights that noise significantly affects the attention calculation in transformer-based video inpainting, leading to inaccurate results. To address this, WaveFormer introduces Discrete Wavelet Transform (DWT) to separate noise into high-frequency components, allowing the use of clean low-frequency components for attention calculation. This approach mitigates the impact of noise on attention computation, enabling the generation of high-quality inpainted videos. The method is evaluated on two benchmark datasets, YouTube-VOS and DAVIS, showing superior performance in terms of PSNR, SSIM, and flow warping error. WaveFormer outperforms existing methods by a significant margin, with relative improvements of 7.45% and 9.48% in PSNR and $ E_{warp} $, respectively. The method is also robust to noise, producing visually plausible and spatial-temporally coherent results. The contributions include theoretically demonstrating the negative impact of noise on attention calculation, proposing a novel wavelet transformer network, and showing the effectiveness of the method on benchmark datasets. The paper also discusses the methodology, including the wavelet spatial-temporal transformer, loss function, and experimental results, demonstrating the effectiveness of the proposed approach in video inpainting.

WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting

2024 | Zhi liang Wu, Changchang Sun, Hanyu Xuan, Gaowen Liu, Yan Yan