9 May 2024 | Sourav Garg, Krishan Rana, Mehdi Hosseinzadeh, Lachlan Mares, Niko Sünderhauf, Feras Dayoub, Ian Reid
RoboHop introduces a novel topological map representation for open-world visual navigation, using image segments as nodes. Unlike traditional pixel-level features, segments are semantically meaningful and open-vocabulary queryable, offering richer and more flexible mapping. The method creates a purely topological graph where edges are formed by segment-level descriptors between consecutive images and pixel centroids within images. This enables continuous place representation through inter-image persistence of segments and intra-image neighbors. Segment-level descriptors are updated using graph convolution layers, improving robot localization through segment-level retrieval. The approach allows navigation plans in the form of 'hops' over segments and object search via natural language queries. Real-world data demonstrates the effectiveness of the method in generating navigation plans and object searches. The system also enables zero-shot real-world navigation through segment-level 'hopping'. The method is evaluated on real-world data, showing improved performance in segment-level data association, topological localization, and planning. The approach is compared with existing methods, highlighting its advantages in semantic expressivity and open-vocabulary querying. The paper also discusses limitations, including dependency on segment-level data association quality, inability to handle dynamic environments, and challenges in handling relational queries. Future work includes integrating visual servoing for real-time feedback and enhancing granularity through metric information. The method leverages foundation models like SAM and DINO for segmentation and data association, enabling zero-shot navigation without requiring 3D maps or learned policies.RoboHop introduces a novel topological map representation for open-world visual navigation, using image segments as nodes. Unlike traditional pixel-level features, segments are semantically meaningful and open-vocabulary queryable, offering richer and more flexible mapping. The method creates a purely topological graph where edges are formed by segment-level descriptors between consecutive images and pixel centroids within images. This enables continuous place representation through inter-image persistence of segments and intra-image neighbors. Segment-level descriptors are updated using graph convolution layers, improving robot localization through segment-level retrieval. The approach allows navigation plans in the form of 'hops' over segments and object search via natural language queries. Real-world data demonstrates the effectiveness of the method in generating navigation plans and object searches. The system also enables zero-shot real-world navigation through segment-level 'hopping'. The method is evaluated on real-world data, showing improved performance in segment-level data association, topological localization, and planning. The approach is compared with existing methods, highlighting its advantages in semantic expressivity and open-vocabulary querying. The paper also discusses limitations, including dependency on segment-level data association quality, inability to handle dynamic environments, and challenges in handling relational queries. Future work includes integrating visual servoing for real-time feedback and enhancing granularity through metric information. The method leverages foundation models like SAM and DINO for segmentation and data association, enabling zero-shot navigation without requiring 3D maps or learned policies.