GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation

GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation

21 Jun 2024 | Chubin Zhang, Hongliang Song, Yi Wei, Yu Chen, Jiwen Lu, yansong Tang
GeoLRM is a geometry-aware large reconstruction model designed to generate high-quality 3D Gaussian representations from 21 input images using only 11 GB of GPU memory. Unlike previous methods that rely on sparse representations and lack explicit geometric relationships between 3D and 2D images, GeoLRM incorporates a 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms to integrate image features into 3D representations. The model employs a two-stage pipeline: a lightweight proposal network generates sparse 3D anchor points from input images, followed by a specialized reconstruction transformer that refines geometry and retrieves textural details. Experimental results show that GeoLRM significantly outperforms existing models, especially for dense view inputs. The model is also demonstrated to be practical for 3D generation tasks, showcasing its versatility and potential for broader adoption in real-world applications. GeoLRM leverages geometric principles to handle up to 21 images, producing superior 3D models compared to those generated from fewer images. The model integrates with SV3D for high-quality 3D model generation. The method addresses the limitations of previous approaches by using a 3D-aware transformer that avoids conventional representations like triplanes or pixel-aligned Gaussians, instead focusing on direct interaction within the 3D space. The model uses a hierarchical image encoder to extract high and low-level image features and an anchor point decoder to transform these features into 3D representations. The model employs deformable attention mechanisms to effectively integrate image features into 3D representations. The model is trained on the Objaverse dataset and tested on the Google Scanned Objects dataset, achieving state-of-the-art performance in four out of five metrics studied. The model demonstrates robust scalability, as its performance improves consistently with the number of input views. The model also shows superior geometric accuracy, benefiting from explicit modeling of the 3D-to-2D relationship. The model's ability to handle complex 3D reconstructions is demonstrated through qualitative comparisons with several LRM-based baselines. The model's architecture and training procedures are detailed in the paper. The model's limitations include the fact that it is not an end-to-end model, leading to error accumulation. The model is not suitable for real-time applications due to the need for a proposal network to process Gaussian points in the whole 3D space. Future work aims to extend the model in an end-to-end manner.GeoLRM is a geometry-aware large reconstruction model designed to generate high-quality 3D Gaussian representations from 21 input images using only 11 GB of GPU memory. Unlike previous methods that rely on sparse representations and lack explicit geometric relationships between 3D and 2D images, GeoLRM incorporates a 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms to integrate image features into 3D representations. The model employs a two-stage pipeline: a lightweight proposal network generates sparse 3D anchor points from input images, followed by a specialized reconstruction transformer that refines geometry and retrieves textural details. Experimental results show that GeoLRM significantly outperforms existing models, especially for dense view inputs. The model is also demonstrated to be practical for 3D generation tasks, showcasing its versatility and potential for broader adoption in real-world applications. GeoLRM leverages geometric principles to handle up to 21 images, producing superior 3D models compared to those generated from fewer images. The model integrates with SV3D for high-quality 3D model generation. The method addresses the limitations of previous approaches by using a 3D-aware transformer that avoids conventional representations like triplanes or pixel-aligned Gaussians, instead focusing on direct interaction within the 3D space. The model uses a hierarchical image encoder to extract high and low-level image features and an anchor point decoder to transform these features into 3D representations. The model employs deformable attention mechanisms to effectively integrate image features into 3D representations. The model is trained on the Objaverse dataset and tested on the Google Scanned Objects dataset, achieving state-of-the-art performance in four out of five metrics studied. The model demonstrates robust scalability, as its performance improves consistently with the number of input views. The model also shows superior geometric accuracy, benefiting from explicit modeling of the 3D-to-2D relationship. The model's ability to handle complex 3D reconstructions is demonstrated through qualitative comparisons with several LRM-based baselines. The model's architecture and training procedures are detailed in the paper. The model's limitations include the fact that it is not an end-to-end model, leading to error accumulation. The model is not suitable for real-time applications due to the need for a proposal network to process Gaussian points in the whole 3D space. Future work aims to extend the model in an end-to-end manner.
Reach us at info@study.space