19 Mar 2024 | Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison
EscherNet is a multi-view conditioned diffusion model for view synthesis that enables the generation of a flexible number of consistent target views with arbitrary camera poses, based on a flexible number of reference views. The model learns implicit and generative 3D representations, coupled with a specialized camera positional encoding, allowing precise and continuous relative control of the camera transformation between reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis, generating more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. It addresses zero-shot novel view synthesis and naturally unifies single- and multi-image 3D reconstruction. EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. The model's design enables efficient and scalable 3D vision applications. EscherNet leverages a transformer architecture, employing dot-product self-attention to capture the intricate relation between reference-to-target and target-to-target view consistencies. A key innovation is the design of camera positional encoding (CaPE), which represents both 4 DoF (object-centric) and 6 DoF camera poses. This encoding incorporates spatial structures into the tokens, enabling the model to compute self-attention between query and key solely based on their relative camera transformation. EscherNet exhibits characteristics of consistency, scalability, and generalisation. It can generate any number of target views with any camera poses based on any number of reference views. EscherNet is evaluated across both novel view synthesis and single/multi-image 3D reconstruction benchmarks, demonstrating superior performance compared to existing 3D diffusion models. The model's ability to generate plausible view synthesis given very limited views contrasts with scene-specific neural rendering methods that often struggle under such constraints. EscherNet's design is simple yet scalable, offering a promising avenue for advancing view synthesis and 3D vision.EscherNet is a multi-view conditioned diffusion model for view synthesis that enables the generation of a flexible number of consistent target views with arbitrary camera poses, based on a flexible number of reference views. The model learns implicit and generative 3D representations, coupled with a specialized camera positional encoding, allowing precise and continuous relative control of the camera transformation between reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis, generating more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. It addresses zero-shot novel view synthesis and naturally unifies single- and multi-image 3D reconstruction. EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. The model's design enables efficient and scalable 3D vision applications. EscherNet leverages a transformer architecture, employing dot-product self-attention to capture the intricate relation between reference-to-target and target-to-target view consistencies. A key innovation is the design of camera positional encoding (CaPE), which represents both 4 DoF (object-centric) and 6 DoF camera poses. This encoding incorporates spatial structures into the tokens, enabling the model to compute self-attention between query and key solely based on their relative camera transformation. EscherNet exhibits characteristics of consistency, scalability, and generalisation. It can generate any number of target views with any camera poses based on any number of reference views. EscherNet is evaluated across both novel view synthesis and single/multi-image 3D reconstruction benchmarks, demonstrating superior performance compared to existing 3D diffusion models. The model's ability to generate plausible view synthesis given very limited views contrasts with scene-specific neural rendering methods that often struggle under such constraints. EscherNet's design is simple yet scalable, offering a promising avenue for advancing view synthesis and 3D vision.