28 Mar 2024 | Yinan Deng, Jiahui Wang, Jingyu Zhao, Xinyu Tian, Guangyan Chen, Yi Yang, Yufeng Yue
OpenGraph is a novel framework for open-vocabulary hierarchical 3D graph representation in large-scale outdoor environments. It enables various downstream tasks, including zero-shot semantic segmentation, open-vocabulary object retrieval, structured topology query, global path planning, and interactive map updating. The framework initially extracts instances and their captions from visual images, enhancing textual reasoning by encoding them. It then achieves 3D incremental object-centric mapping with feature embedding by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results on the SemanticKITTI dataset show that OpenGraph achieves the highest segmentation and query accuracy. OpenGraph leverages visual language models (VLMs) and large language models (LLMs) to enhance object comprehension and reasoning. It introduces a hierarchical graph representation that supports efficient maintenance and rapid retrieval in large-scale environments. The framework includes three main modules: Caption-Enhanced Object Comprehension, Object-Centric Map Construction, and Hierarchical Graph Representation Formation. The Caption-Enhanced Object Comprehension module uses VLMs to extract object captions and LLMs to encode them for enhanced reasoning. The Object-Centric Map Construction module projects 2D images onto 3D LiDAR point clouds to build object-centric maps. The Hierarchical Graph Representation Formation module segments the environment based on lane graph connectivity to construct a hierarchical graph. OpenGraph's hierarchical graph representation enables efficient structured queries and facilitates human-interactive map updating. Experimental results show that OpenGraph achieves accurate zero-shot semantic understanding and superior performance in open-vocabulary object retrieval and hierarchical graph structured queries. The framework demonstrates superior natural language reasoning in various outdoor object retrieval tasks.OpenGraph is a novel framework for open-vocabulary hierarchical 3D graph representation in large-scale outdoor environments. It enables various downstream tasks, including zero-shot semantic segmentation, open-vocabulary object retrieval, structured topology query, global path planning, and interactive map updating. The framework initially extracts instances and their captions from visual images, enhancing textual reasoning by encoding them. It then achieves 3D incremental object-centric mapping with feature embedding by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results on the SemanticKITTI dataset show that OpenGraph achieves the highest segmentation and query accuracy. OpenGraph leverages visual language models (VLMs) and large language models (LLMs) to enhance object comprehension and reasoning. It introduces a hierarchical graph representation that supports efficient maintenance and rapid retrieval in large-scale environments. The framework includes three main modules: Caption-Enhanced Object Comprehension, Object-Centric Map Construction, and Hierarchical Graph Representation Formation. The Caption-Enhanced Object Comprehension module uses VLMs to extract object captions and LLMs to encode them for enhanced reasoning. The Object-Centric Map Construction module projects 2D images onto 3D LiDAR point clouds to build object-centric maps. The Hierarchical Graph Representation Formation module segments the environment based on lane graph connectivity to construct a hierarchical graph. OpenGraph's hierarchical graph representation enables efficient structured queries and facilitates human-interactive map updating. Experimental results show that OpenGraph achieves accurate zero-shot semantic understanding and superior performance in open-vocabulary object retrieval and hierarchical graph structured queries. The framework demonstrates superior natural language reasoning in various outdoor object retrieval tasks.