19 Mar 2024 | Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison
EscherNet is a multi-view conditioned diffusion model designed for scalable view synthesis. It learns implicit and generative 3D representations using a specialized camera positional encoding, enabling precise and continuous control of camera transformations between reference and target views. EscherNet can generate over 100 consistent target views simultaneously on a consumer-grade GPU, despite being trained with only 3 reference views to 3 target views. This model offers exceptional generality, flexibility, and scalability, addressing zero-shot novel view synthesis and unifying single- and multi-image 3D reconstruction into a cohesive framework. Extensive experiments demonstrate state-of-the-art performance in multiple benchmarks, even when compared to methods tailored for specific tasks. EscherNet's design shifts from scene-specific encoding to a 3D representation based solely on scene colors and geometries, making it easier to scale with everyday posed 2D image data. The model's key innovations include the Camera Positional Encoding (CaPE), which encodes camera poses for both 4 DoF and 6 DoF camera transformations, and a transformer architecture that captures intricate relationships between reference-to-target and target-to-target view consistencies. EscherNet's performance is evaluated on novel view synthesis and 3D reconstruction benchmarks, showing superior results compared to existing methods.EscherNet is a multi-view conditioned diffusion model designed for scalable view synthesis. It learns implicit and generative 3D representations using a specialized camera positional encoding, enabling precise and continuous control of camera transformations between reference and target views. EscherNet can generate over 100 consistent target views simultaneously on a consumer-grade GPU, despite being trained with only 3 reference views to 3 target views. This model offers exceptional generality, flexibility, and scalability, addressing zero-shot novel view synthesis and unifying single- and multi-image 3D reconstruction into a cohesive framework. Extensive experiments demonstrate state-of-the-art performance in multiple benchmarks, even when compared to methods tailored for specific tasks. EscherNet's design shifts from scene-specific encoding to a 3D representation based solely on scene colors and geometries, making it easier to scale with everyday posed 2D image data. The model's key innovations include the Camera Positional Encoding (CaPE), which encodes camera poses for both 4 DoF and 6 DoF camera transformations, and a transformer architecture that captures intricate relationships between reference-to-target and target-to-target view consistencies. EscherNet's performance is evaluated on novel view synthesis and 3D reconstruction benchmarks, showing superior results compared to existing methods.