[slides] SpatialRGPT%3A Grounded Spatial Reasoning in Vision Language Model

**SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models** This paper introduces SpatialRGPT, a framework designed to enhance the spatial reasoning capabilities of Vision Language Models (VLMs). The key innovations of SpatialRGPT include a data curation pipeline that enables effective learning of regional representation from 3D scene graphs and a flexible "plugin" module for integrating depth information into the visual encoder of existing VLMs. During inference, SpatialRGPT can accurately perceive the relative directions and distances of user-specified region proposals. The authors also propose SpatialRGPT-Bench, a comprehensive benchmark with ground-truth 3D annotations covering indoor, outdoor, and simulated environments, to evaluate 3D spatial cognition in VLMs. The results demonstrate that SpatialRGPT significantly improves performance in spatial reasoning tasks, both with and without local region prompts, and exhibits strong generalization capabilities. **Contributions:** 1. **SpatialRGPT Framework:** Enhances region-level spatial reasoning in VLMs by enabling effective representation of regional information and acquisition of spatial knowledge. 2. **Data Pipeline:** Constructs region-aware spatial reasoning QAs from existing datasets, creating the Open Spatial Dataset (OSD) with 8.7M spatial concepts grounded in 5M unique regions. 3. **SpatialRGPT-Bench:** A benchmark for evaluating spatial cognition in VLMs, covering diverse environments. 4. **Downstream Applications:** Demonstrates practical applications such as a region-aware dense reward annotator for robotics and complex spatial reasoning. **Methods:** - **3D Scene Graph Construction:** Uses open-vocabulary detection, metric depth estimation, and camera calibration to construct 3D scene graphs from 2D images. - **Learning from 3D Scene Graphs:** Converts scene graphs into textual representations for VLM training using template-based and LLM-based approaches. - **SpatialRGPT Architecture:** Integrates a visual encoder, region-feature extractor, and a flexible plugin module for depth information. - **Training and Inference:** Includes stages for connector feature alignment, visual language pre-training, and visual instruction tuning. **Experiments:** - **SpatialRGPT-Bench:** Evaluates SpatialRGPT on qualitative and quantitative VQA pairs, showing significant improvements over baselines. - **Public Vision-language Benchmarks:** Compares performance on general VQA datasets and region & spatial benchmarks, demonstrating strong region cognition capabilities. - **Real-world Applications:** Demonstrates complex spatial reasoning and multi-hop reasoning, and functions as a region-aware dense reward annotator for robotics. **Discussion and Limitations:** - **Discussion:** Highlights the effectiveness of SpatialRGPT in enhancing spatial reasoning and its potential for practical applications. - **Limitations:** Points out the use of Axis-Aligned Bounding Boxes (AABBs) as a limitation, suggesting future work on more accurate alternatives like Oriented Bounding Boxes (OBBs) or human labeling.**SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models** This paper introduces SpatialRGPT, a framework designed to enhance the spatial reasoning capabilities of Vision Language Models (VLMs). The key innovations of SpatialRGPT include a data curation pipeline that enables effective learning of regional representation from 3D scene graphs and a flexible "plugin" module for integrating depth information into the visual encoder of existing VLMs. During inference, SpatialRGPT can accurately perceive the relative directions and distances of user-specified region proposals. The authors also propose SpatialRGPT-Bench, a comprehensive benchmark with ground-truth 3D annotations covering indoor, outdoor, and simulated environments, to evaluate 3D spatial cognition in VLMs. The results demonstrate that SpatialRGPT significantly improves performance in spatial reasoning tasks, both with and without local region prompts, and exhibits strong generalization capabilities. **Contributions:** 1. **SpatialRGPT Framework:** Enhances region-level spatial reasoning in VLMs by enabling effective representation of regional information and acquisition of spatial knowledge. 2. **Data Pipeline:** Constructs region-aware spatial reasoning QAs from existing datasets, creating the Open Spatial Dataset (OSD) with 8.7M spatial concepts grounded in 5M unique regions. 3. **SpatialRGPT-Bench:** A benchmark for evaluating spatial cognition in VLMs, covering diverse environments. 4. **Downstream Applications:** Demonstrates practical applications such as a region-aware dense reward annotator for robotics and complex spatial reasoning. **Methods:** - **3D Scene Graph Construction:** Uses open-vocabulary detection, metric depth estimation, and camera calibration to construct 3D scene graphs from 2D images. - **Learning from 3D Scene Graphs:** Converts scene graphs into textual representations for VLM training using template-based and LLM-based approaches. - **SpatialRGPT Architecture:** Integrates a visual encoder, region-feature extractor, and a flexible plugin module for depth information. - **Training and Inference:** Includes stages for connector feature alignment, visual language pre-training, and visual instruction tuning. **Experiments:** - **SpatialRGPT-Bench:** Evaluates SpatialRGPT on qualitative and quantitative VQA pairs, showing significant improvements over baselines. - **Public Vision-language Benchmarks:** Compares performance on general VQA datasets and region & spatial benchmarks, demonstrating strong region cognition capabilities. - **Real-world Applications:** Demonstrates complex spatial reasoning and multi-hop reasoning, and functions as a region-aware dense reward annotator for robotics. **Discussion and Limitations:** - **Discussion:** Highlights the effectiveness of SpatialRGPT in enhancing spatial reasoning and its potential for practical applications. - **Limitations:** Points out the use of Axis-Aligned Bounding Boxes (AABBs) as a limitation, suggesting future work on more accurate alternatives like Oriented Bounding Boxes (OBBs) or human labeling.

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

15 Oct 2024 | An-Chieh Cheng¹, Hongxu Yin², Yang Fu¹, Qiushan Guo², Ruihan Yang¹, Jan Kautz², Xiaolong Wang¹², Sifei Liu²