SPAD: Spatially Aware Multi-View Diffusers

SPAD: Spatially Aware Multi-View Diffusers

7 Feb 2024 | Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin
SPAD (Spatially Aware Multi-View Diffusers) is a novel framework for generating consistent multi-view images from text prompts or single images. The approach repurposes a pre-trained 2D diffusion model by extending its self-attention layers with cross-view interactions and fine-tuning it on high-quality Objaverse data. To address content copying between views, SPAD explicitly constrains cross-view attention based on epipolar geometry. Additionally, Plücker coordinates derived from camera rays are used as positional encodings to enhance 3D consistency. SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. It also prevents the multi-face Janus issue in text-to-3D generation. The method can operate in two modes: text-conditioned and image-conditioned, and evaluates well on various metrics such as FID, Inception Score, PSNR, SSIM, and LPIPS. SPAD demonstrates superior performance in generating high-quality, 3D consistent multi-view images and novel views.SPAD (Spatially Aware Multi-View Diffusers) is a novel framework for generating consistent multi-view images from text prompts or single images. The approach repurposes a pre-trained 2D diffusion model by extending its self-attention layers with cross-view interactions and fine-tuning it on high-quality Objaverse data. To address content copying between views, SPAD explicitly constrains cross-view attention based on epipolar geometry. Additionally, Plücker coordinates derived from camera rays are used as positional encodings to enhance 3D consistency. SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. It also prevents the multi-face Janus issue in text-to-3D generation. The method can operate in two modes: text-conditioned and image-conditioned, and evaluates well on various metrics such as FID, Inception Score, PSNR, SSIM, and LPIPS. SPAD demonstrates superior performance in generating high-quality, 3D consistent multi-view images and novel views.
Reach us at info@study.space
[slides and audio] SPAD%3A Spatially Aware Multi-View Diffusers