SpatialBot: Precise Spatial Understanding with Vision Language Models

SpatialBot: Precise Spatial Understanding with Vision Language Models

19 Mar 2025 | Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, Bo Zhao
The paper "SpatialBot: Precise Spatial Understanding with Vision Language Models" introduces SpatialBot, a model designed to enhance spatial understanding by utilizing both RGB and depth images. The authors address the challenges of depth perception in VLMs by introducing the SpatialQA and SpatialQA-E datasets, which include multi-level depth-related questions and tasks. They also develop SpatialBench to evaluate VLMs' spatial understanding capabilities. Extensive experiments demonstrate that SpatialBot significantly improves spatial understanding and performance on general VLM benchmarks and embodied AI tasks. The model, code, and datasets are available online. Key contributions include the proposal of SpatialBot, the creation of the SpatialQA and SpatialQA-E datasets, and the evaluation of SpatialBot's performance on various benchmarks and tasks.The paper "SpatialBot: Precise Spatial Understanding with Vision Language Models" introduces SpatialBot, a model designed to enhance spatial understanding by utilizing both RGB and depth images. The authors address the challenges of depth perception in VLMs by introducing the SpatialQA and SpatialQA-E datasets, which include multi-level depth-related questions and tasks. They also develop SpatialBench to evaluate VLMs' spatial understanding capabilities. Extensive experiments demonstrate that SpatialBot significantly improves spatial understanding and performance on general VLM benchmarks and embodied AI tasks. The model, code, and datasets are available online. Key contributions include the proposal of SpatialBot, the creation of the SpatialQA and SpatialQA-E datasets, and the evaluation of SpatialBot's performance on various benchmarks and tasks.
Reach us at info@study.space
[slides and audio] SpatialBot%3A Precise Spatial Understanding with Vision Language Models