latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

30 Jul 2024 | Christopher Wewer1, Kevin Raj1, Eddy Ilg2, Bernt Schiele1, Jan Eric Lenssen1
**latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction** **Authors:** Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, and Jan Eric Lenssen **Institution:** Max Planck Institute for Informatics, Saarland University **Abstract:** latentSplat is a method for scalable and generalizable 3D reconstruction from two reference views. It autoencodes the views into a 3D latent representation of variational feature Gaussians, enabling fast novel view synthesis. The method combines regression-based and generative approaches, trained purely on real video data. The core of latentSplat is the variational 3D Gaussian representation, which models uncertainty explicitly. This representation allows for efficient sampling and rendering of specific instances via splatting and a lightweight generative decoder. latentSplat outperforms previous methods in reconstruction quality and generalization, while being fast and scalable to high-resolution data. **Keywords:** 3D Reconstruction, Novel View Synthesis, Feature Gaussian Splatting, Efficient 3D Representation Learning **Introduction:** The task of 3D reconstruction from a few images has seen advancements through regression-based and generative approaches. regression-based methods like pixelNeRF and pixelSplat are efficient but struggle in uncertain regions, while generative models like Zero-1-to-3 and GeNVS handle high uncertainty well but are slow and not scalable to large scenes. latentSplat combines the strengths of both by using variational 3D Gaussians, which model uncertainty explicitly and can be efficiently rendered. **Methods:** The method involves encoding two reference views into a 3D latent representation of variational Gaussians, which are then sampled to render specific instances. The encoder uses a vision transformer, an epipolar transformer, and a Gaussian sampling head. The decoder is a lightweight VAE-GAN network that decodes the sampled instances into novel views. Training is done using a combination of reconstruction, auxiliary, and generative losses. **Experiments:** latentSplat outperforms previous methods in both object-centric and general scene reconstruction, achieving state-of-the-art quality in generative metrics and perceptual metrics. It also generalizes well to extrapolation, producing high-quality reconstructions even for unseen areas. The method is efficient, maintaining real-time rendering and memory efficiency. **Conclusion:** latentSplat successfully combines regression-based and generative approaches, achieving state-of-the-art quality in novel view synthesis from two input images. It provides high-quality, 3D consistent reconstructions and is much faster and more scalable than previous generative approaches.**latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction** **Authors:** Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, and Jan Eric Lenssen **Institution:** Max Planck Institute for Informatics, Saarland University **Abstract:** latentSplat is a method for scalable and generalizable 3D reconstruction from two reference views. It autoencodes the views into a 3D latent representation of variational feature Gaussians, enabling fast novel view synthesis. The method combines regression-based and generative approaches, trained purely on real video data. The core of latentSplat is the variational 3D Gaussian representation, which models uncertainty explicitly. This representation allows for efficient sampling and rendering of specific instances via splatting and a lightweight generative decoder. latentSplat outperforms previous methods in reconstruction quality and generalization, while being fast and scalable to high-resolution data. **Keywords:** 3D Reconstruction, Novel View Synthesis, Feature Gaussian Splatting, Efficient 3D Representation Learning **Introduction:** The task of 3D reconstruction from a few images has seen advancements through regression-based and generative approaches. regression-based methods like pixelNeRF and pixelSplat are efficient but struggle in uncertain regions, while generative models like Zero-1-to-3 and GeNVS handle high uncertainty well but are slow and not scalable to large scenes. latentSplat combines the strengths of both by using variational 3D Gaussians, which model uncertainty explicitly and can be efficiently rendered. **Methods:** The method involves encoding two reference views into a 3D latent representation of variational Gaussians, which are then sampled to render specific instances. The encoder uses a vision transformer, an epipolar transformer, and a Gaussian sampling head. The decoder is a lightweight VAE-GAN network that decodes the sampled instances into novel views. Training is done using a combination of reconstruction, auxiliary, and generative losses. **Experiments:** latentSplat outperforms previous methods in both object-centric and general scene reconstruction, achieving state-of-the-art quality in generative metrics and perceptual metrics. It also generalizes well to extrapolation, producing high-quality reconstructions even for unseen areas. The method is efficient, maintaining real-time rendering and memory efficiency. **Conclusion:** latentSplat successfully combines regression-based and generative approaches, achieving state-of-the-art quality in novel view synthesis from two input images. It provides high-quality, 3D consistent reconstructions and is much faster and more scalable than previous generative approaches.
Reach us at info@study.space