30 Apr 2024 | Kai Zhang*1 Sai Bi*1 Hao Tan*1 Yuanbo Xiangli2 Nanxuan Zhao1 Kalyan Sunkavalli1 Zexiang Xu1
**GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting**
**Authors:** Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, Zexiang Xu
**Affiliations:** Adobe Research, Cornell University
**Abstract:**
We propose GS-LRM, a scalable large reconstruction model that predicts high-quality 3D Gaussian primitives from 2-4 posed sparse images in approximately 0.23 seconds on a single A100 GPU. Our model features a simple transformer-based architecture, where input posed images are patched and concatenated, then passed through a sequence of transformer blocks. The final output directly decodes per-pixel Gaussian parameters from these tokens for differentiable rendering. Unlike previous LRM models that can only reconstruct objects, GS-LRM naturally handles scenes with large variations in scale and complexity by predicting per-pixel Gaussians. We demonstrate the model's effectiveness on both object and scene captures, achieving state-of-the-art performance on large-scale datasets such as Objaverse and RealEstate10K. The model also shows superior results in downstream 3D generation tasks.
**Keywords:**
Large Reconstruction Models, 3D Reconstruction, Gaussian Splatting
**Introduction:**
Reconstructing 3D scenes from image captures is a central challenge in computer vision. Traditional methods rely on complex photogrammetry systems, while recent advancements in neural representations and differentiable rendering have shown superior reconstruction and rendering quality. However, these methods are slow and still require a large number of input views. Transformer-based 3D large reconstruction models (LRMs) have emerged, learning general 3D reconstruction priors from vast collections of 3D objects. However, these models often suffer from limited triplane resolution and expensive volume rendering. Our goal is to build a general, scalable, and efficient 3D reconstruction model. GS-LRM is designed to predict 3D Gaussian primitives from sparse input images, enabling fast and high-quality rendering and reconstruction for both objects and scenes.
**Method:**
Our model uses a transformer-based architecture to regress per-pixel 3D Gaussian parameters from a set of images with known camera poses. Input images are tokenized into patch tokens, which are then passed through a sequence of transformer blocks. The output tokens are decoded into pixel-aligned 3D Gaussians using a linear layer. The final output merges all 3D Gaussians from all input views. The model is trained on large datasets and achieves high-quality sparse-view reconstruction for both object and scene scenarios.
**Experiments:**
We evaluate GS-LRM on object-level and scene-level datasets, showing significant improvements over state-of-the-art baselines in terms of PSNR, SSIM, and LPIPS metrics. Visual comparisons demonstrate that GS-LRM outperforms competing methods in terms of sharpness, texture, and geometry reconstruction. We also showcase high-resolution recon**GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting**
**Authors:** Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, Zexiang Xu
**Affiliations:** Adobe Research, Cornell University
**Abstract:**
We propose GS-LRM, a scalable large reconstruction model that predicts high-quality 3D Gaussian primitives from 2-4 posed sparse images in approximately 0.23 seconds on a single A100 GPU. Our model features a simple transformer-based architecture, where input posed images are patched and concatenated, then passed through a sequence of transformer blocks. The final output directly decodes per-pixel Gaussian parameters from these tokens for differentiable rendering. Unlike previous LRM models that can only reconstruct objects, GS-LRM naturally handles scenes with large variations in scale and complexity by predicting per-pixel Gaussians. We demonstrate the model's effectiveness on both object and scene captures, achieving state-of-the-art performance on large-scale datasets such as Objaverse and RealEstate10K. The model also shows superior results in downstream 3D generation tasks.
**Keywords:**
Large Reconstruction Models, 3D Reconstruction, Gaussian Splatting
**Introduction:**
Reconstructing 3D scenes from image captures is a central challenge in computer vision. Traditional methods rely on complex photogrammetry systems, while recent advancements in neural representations and differentiable rendering have shown superior reconstruction and rendering quality. However, these methods are slow and still require a large number of input views. Transformer-based 3D large reconstruction models (LRMs) have emerged, learning general 3D reconstruction priors from vast collections of 3D objects. However, these models often suffer from limited triplane resolution and expensive volume rendering. Our goal is to build a general, scalable, and efficient 3D reconstruction model. GS-LRM is designed to predict 3D Gaussian primitives from sparse input images, enabling fast and high-quality rendering and reconstruction for both objects and scenes.
**Method:**
Our model uses a transformer-based architecture to regress per-pixel 3D Gaussian parameters from a set of images with known camera poses. Input images are tokenized into patch tokens, which are then passed through a sequence of transformer blocks. The output tokens are decoded into pixel-aligned 3D Gaussians using a linear layer. The final output merges all 3D Gaussians from all input views. The model is trained on large datasets and achieves high-quality sparse-view reconstruction for both object and scene scenarios.
**Experiments:**
We evaluate GS-LRM on object-level and scene-level datasets, showing significant improvements over state-of-the-art baselines in terms of PSNR, SSIM, and LPIPS metrics. Visual comparisons demonstrate that GS-LRM outperforms competing methods in terms of sharpness, texture, and geometry reconstruction. We also showcase high-resolution recon