15 Oct 2024 | An-Chieh Cheng¹, Hongxu Yin², Yang Fu¹, Qiushan Guo², Ruihan Yang¹, Jan Kautz², Xiaolong Wang¹², Sifei Liu²
SpatialRGPT is a novel framework designed to enhance the spatial reasoning capabilities of Vision Language Models (VLMs). It introduces a region representation module and a flexible plugin for depth information, enabling VLMs to effectively perceive spatial arrangements at both local and global scopes. The framework uses a data curation pipeline to learn 3D spatial knowledge from scene graphs and provides a comprehensive benchmark, SpatialRGPT-Bench, for evaluating spatial cognition across diverse environments. The results demonstrate significant improvements in spatial reasoning tasks, showcasing the model's ability to reason complex spatial relations and perform as dense reward annotators for robotic applications.
SpatialRGPT advances VLMs' spatial understanding through two key innovations: (i) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (ii) a flexible "plugin" module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, SpatialRGPT-Bench is a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks.
The framework includes a 3D scene graph construction pipeline that begins with a filtering process to remove unsuitable images. Using open-vocabulary models, candidate objects are identified and grounded, followed by lifting them into 3D space using metric depth estimation and camera calibration. The point clouds are then processed to construct the final 3D scene graph. The 3D scene graph is a collection of tuples where nodes represent specific 3D object instances, and edges represent the spatial relationships between the nodes. Each node is defined by the object's class, width, and height in metric scale. The edges represent the spatial relationships between the nodes within two types of relations: relative and metric.
SpatialRGPT's VLM architecture includes a visual encoder to encode vision features, a region-feature extractor to obtain region-level embeddings, linear connectors to project multi-modal embeddings into the word embedding space, and a large language model using LLaMA2-7B for language processing. The framework incorporates depth information into SpatialRGPT through a plugin module that seamlessly integrates depth information. The depth connector's weights are initialized from the RGB connector and trained only on spatial-related QAs. This flexible design allows the 2D visual encoder to leverage additional depth representation while still functioning when depth inputs are not presented.
The training and inference paradigm of SpatialRGPT includes three stages: (i) Connector Feature Alignment, (ii) Visual Language Pre-training, and (iii) Visual Instruction-tuning. During the first stageSpatialRGPT is a novel framework designed to enhance the spatial reasoning capabilities of Vision Language Models (VLMs). It introduces a region representation module and a flexible plugin for depth information, enabling VLMs to effectively perceive spatial arrangements at both local and global scopes. The framework uses a data curation pipeline to learn 3D spatial knowledge from scene graphs and provides a comprehensive benchmark, SpatialRGPT-Bench, for evaluating spatial cognition across diverse environments. The results demonstrate significant improvements in spatial reasoning tasks, showcasing the model's ability to reason complex spatial relations and perform as dense reward annotators for robotic applications.
SpatialRGPT advances VLMs' spatial understanding through two key innovations: (i) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (ii) a flexible "plugin" module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, SpatialRGPT-Bench is a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks.
The framework includes a 3D scene graph construction pipeline that begins with a filtering process to remove unsuitable images. Using open-vocabulary models, candidate objects are identified and grounded, followed by lifting them into 3D space using metric depth estimation and camera calibration. The point clouds are then processed to construct the final 3D scene graph. The 3D scene graph is a collection of tuples where nodes represent specific 3D object instances, and edges represent the spatial relationships between the nodes. Each node is defined by the object's class, width, and height in metric scale. The edges represent the spatial relationships between the nodes within two types of relations: relative and metric.
SpatialRGPT's VLM architecture includes a visual encoder to encode vision features, a region-feature extractor to obtain region-level embeddings, linear connectors to project multi-modal embeddings into the word embedding space, and a large language model using LLaMA2-7B for language processing. The framework incorporates depth information into SpatialRGPT through a plugin module that seamlessly integrates depth information. The depth connector's weights are initialized from the RGB connector and trained only on spatial-related QAs. This flexible design allows the 2D visual encoder to leverage additional depth representation while still functioning when depth inputs are not presented.
The training and inference paradigm of SpatialRGPT includes three stages: (i) Connector Feature Alignment, (ii) Visual Language Pre-training, and (iii) Visual Instruction-tuning. During the first stage