GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

27 May 2024 | Junyoung Seo1,3*, Kazumi Fukuda1, Takashi Shibuya1, Takuya Narihira1, Naoki Murata1, Shoukang Hu1, Chieh-Hsin Lai1, Seungryong Kim3†, Yuki Mitsufuji1,2†
**GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping** introduces a novel approach for generating high-quality novel views from a single input image, while preserving semantic details. The method addresses the limitations of existing techniques, which often struggle with noisy depth maps and loss of semantic information during geometric warping. GenWarp integrates view warping and occlusion inpainting into a unified process, using a two-stream architecture that includes a semantic preserver network and a diffusion model. By augmenting self-attention with cross-view attention, the model learns to determine where to warp and where to generate, effectively handling both in-domain and out-of-domain images. Extensive experiments on datasets like RealEstate10K, ScanNet, and in-the-wild images demonstrate that GenWarp outperforms existing methods in terms of both qualitative and quantitative metrics. The approach leverages large-scale text-to-image (T2I) models and monocular depth estimation (MDE) to achieve high-quality novel view synthesis, making it a promising solution for applications requiring flexible camera viewpoint changes in generated images.**GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping** introduces a novel approach for generating high-quality novel views from a single input image, while preserving semantic details. The method addresses the limitations of existing techniques, which often struggle with noisy depth maps and loss of semantic information during geometric warping. GenWarp integrates view warping and occlusion inpainting into a unified process, using a two-stream architecture that includes a semantic preserver network and a diffusion model. By augmenting self-attention with cross-view attention, the model learns to determine where to warp and where to generate, effectively handling both in-domain and out-of-domain images. Extensive experiments on datasets like RealEstate10K, ScanNet, and in-the-wild images demonstrate that GenWarp outperforms existing methods in terms of both qualitative and quantitative metrics. The approach leverages large-scale text-to-image (T2I) models and monocular depth estimation (MDE) to achieve high-quality novel view synthesis, making it a promising solution for applications requiring flexible camera viewpoint changes in generated images.
Reach us at info@study.space