Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

26 Apr 2024 | Zhengze Xu1*, Mengting Chen2, Zhao Wang2, Linyu Xing2, Zhonghua Zhai2, Nong Sang1, Jinsong Lan2, Shuai Xiao2†, Changxin Gao1†
The paper "Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos" addresses the challenging task of video virtual try-on, which aims to dress a person in a video with clothing while preserving both the appearance of the clothing and the person's movements. The main obstacle in this task is the need to preserve fine details of the clothing and model coherent motions simultaneously. To tackle these challenges, the authors propose a diffusion-based framework named "Tunnel Try-on." The core idea is to extract a "focus tunnel" in the input video, which zooms in on the region around the clothing to better preserve its details. The model uses Kalman filtering to smooth the focus tunnel and injects the tunnel's position embedding into attention layers to improve video continuity. Additionally, an environment encoder is developed to extract global context information outside the tunnels, enhancing the background generation. Extensive experiments demonstrate that Tunnel Try-on significantly outperforms other video virtual try-on methods, achieving state-of-the-art performance in complex scenarios. The contributions of the paper include the introduction of Tunnel Try-on, a novel focus tunnel extraction strategy, and enhancing techniques such as tunnel smoothing and tunnel embedding. The method is evaluated on two datasets, showing superior performance in both qualitative and quantitative metrics, including SSIM, LPIPS, and VFID.The paper "Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos" addresses the challenging task of video virtual try-on, which aims to dress a person in a video with clothing while preserving both the appearance of the clothing and the person's movements. The main obstacle in this task is the need to preserve fine details of the clothing and model coherent motions simultaneously. To tackle these challenges, the authors propose a diffusion-based framework named "Tunnel Try-on." The core idea is to extract a "focus tunnel" in the input video, which zooms in on the region around the clothing to better preserve its details. The model uses Kalman filtering to smooth the focus tunnel and injects the tunnel's position embedding into attention layers to improve video continuity. Additionally, an environment encoder is developed to extract global context information outside the tunnels, enhancing the background generation. Extensive experiments demonstrate that Tunnel Try-on significantly outperforms other video virtual try-on methods, achieving state-of-the-art performance in complex scenarios. The contributions of the paper include the introduction of Tunnel Try-on, a novel focus tunnel extraction strategy, and enhancing techniques such as tunnel smoothing and tunnel embedding. The method is evaluated on two datasets, showing superior performance in both qualitative and quantitative metrics, including SSIM, LPIPS, and VFID.
Reach us at info@study.space
Understanding Tunnel Try-on%3A Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos