18 Jul 2024 | Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, Jianfei Cai
**Abstract:**
MVSpalt is an efficient model that predicts clean feed-forward 3D Gaussians from sparse multi-view images. To accurately localize the Gaussian centers, the model builds a cost volume representation via plane sweeping, which stores cross-view feature similarities and provides valuable geometry cues for depth estimation. The model also learns other Gaussian primitives' parameters jointly with the Gaussian centers, relying only on photometric supervision. On large-scale benchmarks like RealEstate10K and ACID, MVSpalt achieves state-of-the-art performance with the fastest feed-forward inference speed (22 fps). Notably, compared to the latest method pixelSplat, MVSpalt uses 10× fewer parameters and infers more than 2× faster while providing higher appearance and geometry quality and better cross-dataset generalization.
**Keywords:**
Feature Matching · Cost Volume · Gaussian Splatting
**Introduction:**
The paper addresses the problem of 3D scene reconstruction and novel view synthesis from very sparse images using a single forward pass of a trained model. While recent methods like Scene Representation Networks (SRN), Neural Radiance Fields (NeRF), and Light Field Networks (LFN) have made significant progress, they are still not practical due to expensive per-scene optimization, high memory cost, and slow rendering speed. MVSpalt introduces a feed-forward Gaussian Splatting model that efficiently predicts 3D Gaussians from sparse multi-view images. Unlike pixelSplat, which regresses probabilistic depth distributions, MVSpalt uses a cost volume to learn feature matching information, enabling more geometry-aware and lightweight models.
**Method:**
MVSpalt includes multi-view feature extraction, cost volume construction, cost volume refinement, depth estimation, and depth refinement. The key component is the cost volume, which models cross-view feature matching information using plane-sweep stereo. The model predicts Gaussian centers, opacity, covariance, and color using a 2D U-Net with cross-view attention. Training is done using a simple rendering loss between rendered and ground truth images.
**Experiments:**
MVSpalt is evaluated on the RealEstate10K and ACID datasets, showing superior performance in terms of image quality, inference speed, and cross-dataset generalization compared to state-of-the-art methods. Ablation studies further validate the importance of the cost volume and cross-view attention in the model's effectiveness.
**Conclusion:**
MVSpalt sets a new standard for efficient and high-quality 3D scene reconstruction and novel view synthesis from sparse multi-view images, offering faster inference and better geometry reconstruction compared to existing methods.**Abstract:**
MVSpalt is an efficient model that predicts clean feed-forward 3D Gaussians from sparse multi-view images. To accurately localize the Gaussian centers, the model builds a cost volume representation via plane sweeping, which stores cross-view feature similarities and provides valuable geometry cues for depth estimation. The model also learns other Gaussian primitives' parameters jointly with the Gaussian centers, relying only on photometric supervision. On large-scale benchmarks like RealEstate10K and ACID, MVSpalt achieves state-of-the-art performance with the fastest feed-forward inference speed (22 fps). Notably, compared to the latest method pixelSplat, MVSpalt uses 10× fewer parameters and infers more than 2× faster while providing higher appearance and geometry quality and better cross-dataset generalization.
**Keywords:**
Feature Matching · Cost Volume · Gaussian Splatting
**Introduction:**
The paper addresses the problem of 3D scene reconstruction and novel view synthesis from very sparse images using a single forward pass of a trained model. While recent methods like Scene Representation Networks (SRN), Neural Radiance Fields (NeRF), and Light Field Networks (LFN) have made significant progress, they are still not practical due to expensive per-scene optimization, high memory cost, and slow rendering speed. MVSpalt introduces a feed-forward Gaussian Splatting model that efficiently predicts 3D Gaussians from sparse multi-view images. Unlike pixelSplat, which regresses probabilistic depth distributions, MVSpalt uses a cost volume to learn feature matching information, enabling more geometry-aware and lightweight models.
**Method:**
MVSpalt includes multi-view feature extraction, cost volume construction, cost volume refinement, depth estimation, and depth refinement. The key component is the cost volume, which models cross-view feature matching information using plane-sweep stereo. The model predicts Gaussian centers, opacity, covariance, and color using a 2D U-Net with cross-view attention. Training is done using a simple rendering loss between rendered and ground truth images.
**Experiments:**
MVSpalt is evaluated on the RealEstate10K and ACID datasets, showing superior performance in terms of image quality, inference speed, and cross-dataset generalization compared to state-of-the-art methods. Ablation studies further validate the importance of the cost volume and cross-view attention in the model's effectiveness.
**Conclusion:**
MVSpalt sets a new standard for efficient and high-quality 3D scene reconstruction and novel view synthesis from sparse multi-view images, offering faster inference and better geometry reconstruction compared to existing methods.