30 Apr 2024 | Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, Zexiang Xu
GS-LRM is a scalable large reconstruction model that predicts high-quality 3D Gaussian primitives from 2-4 posed sparse images in ~0.23 seconds on a single A100 GPU. The model uses a simple transformer-based architecture, patchifying input images, passing concatenated multi-view image tokens through a sequence of transformer blocks, and decoding final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. Unlike previous LRMs that can only reconstruct objects, GS-LRM naturally handles scenes with large variations in scale and complexity by predicting per-pixel Gaussians. The model is trained on Objaverse and RealEstate10K datasets for object and scene captures, respectively, and outperforms state-of-the-art baselines in both scenarios. GS-LRM is also applied to downstream 3D generation tasks, demonstrating its effectiveness in both object and scene-level reconstructions. The model's transformer-based architecture allows for efficient and accurate 3D reconstruction, with per-pixel Gaussian prediction enabling better handling of high-frequency details and large-scale scenes. GS-LRM is scalable in terms of model size, training data, and scene scale, and achieves high-quality sparse-view reconstruction for both object and scene scenarios. The model is evaluated against various baselines, showing significant improvements in quantitative metrics such as PSNR and SSIM. The model is also applied to text-to-3D and image-to-3D generation, demonstrating its versatility in 3D reconstruction tasks.GS-LRM is a scalable large reconstruction model that predicts high-quality 3D Gaussian primitives from 2-4 posed sparse images in ~0.23 seconds on a single A100 GPU. The model uses a simple transformer-based architecture, patchifying input images, passing concatenated multi-view image tokens through a sequence of transformer blocks, and decoding final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. Unlike previous LRMs that can only reconstruct objects, GS-LRM naturally handles scenes with large variations in scale and complexity by predicting per-pixel Gaussians. The model is trained on Objaverse and RealEstate10K datasets for object and scene captures, respectively, and outperforms state-of-the-art baselines in both scenarios. GS-LRM is also applied to downstream 3D generation tasks, demonstrating its effectiveness in both object and scene-level reconstructions. The model's transformer-based architecture allows for efficient and accurate 3D reconstruction, with per-pixel Gaussian prediction enabling better handling of high-frequency details and large-scale scenes. GS-LRM is scalable in terms of model size, training data, and scene scale, and achieves high-quality sparse-view reconstruction for both object and scene scenarios. The model is evaluated against various baselines, showing significant improvements in quantitative metrics such as PSNR and SSIM. The model is also applied to text-to-3D and image-to-3D generation, demonstrating its versatility in 3D reconstruction tasks.