SPAD: Spatially Aware Multi-View Diffusers

SPAD: Spatially Aware Multi-View Diffusers

7 Feb 2024 | Yash Kant¹,²,⁴, Ziyi Wu¹,⁴, Michael Vasilkovsky², Guocheng Qian²,³, Jian Ren², Riza Alp Guler², Bernard Ghanem³, Sergey Tulyakov², Igor Gilitschenski¹,⁴,*, Aliaksandr Siarohin²,*
SPAD is a novel approach for generating consistent multi-view images from text prompts or single images. It extends a pretrained 2D diffusion model by adding cross-view interactions through self-attention layers and fine-tunes it on a subset of Objaverse. To prevent content copying between views, SPAD uses epipolar geometry to constrain cross-view attention. Additionally, it incorporates Plücker coordinates as positional encodings to enhance 3D consistency and prevent flipped view predictions. SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. It also enables high-quality text-to-3D generation by using a feed-forward multi-view to 3D triplane generator and multi-view Score Distillation Sampling. SPAD outperforms existing methods in terms of 3D consistency, image quality, and camera control. It is capable of generating consistent multi-view images of diverse 3D objects, ranging from daily items to complex machines. SPAD also demonstrates strong performance in text-to-3D generation, producing high-quality 3D models without the multi-face Janus issue. The method is evaluated on multiple datasets and shows significant improvements over existing approaches in terms of image generation quality and 3D consistency.SPAD is a novel approach for generating consistent multi-view images from text prompts or single images. It extends a pretrained 2D diffusion model by adding cross-view interactions through self-attention layers and fine-tunes it on a subset of Objaverse. To prevent content copying between views, SPAD uses epipolar geometry to constrain cross-view attention. Additionally, it incorporates Plücker coordinates as positional encodings to enhance 3D consistency and prevent flipped view predictions. SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. It also enables high-quality text-to-3D generation by using a feed-forward multi-view to 3D triplane generator and multi-view Score Distillation Sampling. SPAD outperforms existing methods in terms of 3D consistency, image quality, and camera control. It is capable of generating consistent multi-view images of diverse 3D objects, ranging from daily items to complex machines. SPAD also demonstrates strong performance in text-to-3D generation, producing high-quality 3D models without the multi-face Janus issue. The method is evaluated on multiple datasets and shows significant improvements over existing approaches in terms of image generation quality and 3D consistency.
Reach us at info@study.space