3 Jun 2024 | Abdelrhman Werby*, Chenguang Huang*, Martin Büchner*, Abhinav Valada*, Wolfram Burgard*
This paper introduces HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded indoor robot navigation. The method leverages open-vocabulary vision foundation models to create state-of-the-art open-vocabulary segment-level maps in 3D and constructs a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. HOV-SG allows robotic traversal of multi-story buildings using a cross-floor Voronoi graph. The approach is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. The method enables successful long-horizon language-conditioned robot navigation within real-world multi-story environments. HOV-SG is evaluated on the Replica and ScanNet datasets, and the results show that HOV-SG outperforms existing methods in 3D semantic segmentation. The method also enables hierarchical concept retrieval and navigation in real-world environments. The paper presents a novel evaluation metric for open-vocabulary semantics, AUC_top-k, and demonstrates the effectiveness of HOV-SG in real-world navigation tasks. The code and evaluation protocol are publicly available at https://hovsg.github.io.This paper introduces HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded indoor robot navigation. The method leverages open-vocabulary vision foundation models to create state-of-the-art open-vocabulary segment-level maps in 3D and constructs a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. HOV-SG allows robotic traversal of multi-story buildings using a cross-floor Voronoi graph. The approach is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. The method enables successful long-horizon language-conditioned robot navigation within real-world multi-story environments. HOV-SG is evaluated on the Replica and ScanNet datasets, and the results show that HOV-SG outperforms existing methods in 3D semantic segmentation. The method also enables hierarchical concept retrieval and navigation in real-world environments. The paper presents a novel evaluation metric for open-vocabulary semantics, AUC_top-k, and demonstrates the effectiveness of HOV-SG in real-world navigation tasks. The code and evaluation protocol are publicly available at https://hovsg.github.io.