Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

31 Jul 2024 | Daniel Honerkamp*, Martin Büchner*, Fabien Despinoy², Tim Welschehold¹, Abhinav Valada¹
This paper introduces MoMa-LLM, a novel approach that integrates large language models (LLMs) with dynamic, open-vocabulary scene graphs for interactive object search in large, unexplored environments. The method constructs scene graphs from dense maps and Voronoi graphs, dynamically updated as the environment is explored. These representations are tightly interwoven with an object-centric action space, enabling the LLM to generate high-level commands that are executed by low-level subpolicies. The approach is zero-shot, open-vocabulary, and scalable to various mobile manipulation and household robotic tasks. The paper presents a semantic interactive search task in large realistic indoor environments, where an agent must find a target object within an environment, including navigating through doors and searching inside cabinets and drawers. This task is challenging due to the need for reasoning about manipulation and navigation skills in unexplored environments. The authors introduce a novel evaluation paradigm for object search tasks using full efficiency curves, which allows for a more accurate comparison of methods than traditional time-based metrics. The approach is evaluated in both simulation and the real world, demonstrating significantly improved search efficiency compared to conventional baselines and state-of-the-art methods. The results show that MoMa-LLM outperforms other approaches in terms of success rate, search efficiency, and ability to handle abstract tasks. The method is also shown to be robust to various room layouts and capable of handling open-vocabulary room classification. The paper concludes that MoMa-LLM provides a scalable and efficient solution for interactive object search in large, unexplored environments, leveraging the capabilities of LLMs and dynamic scene graphs to enable effective reasoning and planning. The approach is applicable to a wide range of mobile manipulation and household robotic tasks, demonstrating the potential for further extension to more complex and abstract tasks.This paper introduces MoMa-LLM, a novel approach that integrates large language models (LLMs) with dynamic, open-vocabulary scene graphs for interactive object search in large, unexplored environments. The method constructs scene graphs from dense maps and Voronoi graphs, dynamically updated as the environment is explored. These representations are tightly interwoven with an object-centric action space, enabling the LLM to generate high-level commands that are executed by low-level subpolicies. The approach is zero-shot, open-vocabulary, and scalable to various mobile manipulation and household robotic tasks. The paper presents a semantic interactive search task in large realistic indoor environments, where an agent must find a target object within an environment, including navigating through doors and searching inside cabinets and drawers. This task is challenging due to the need for reasoning about manipulation and navigation skills in unexplored environments. The authors introduce a novel evaluation paradigm for object search tasks using full efficiency curves, which allows for a more accurate comparison of methods than traditional time-based metrics. The approach is evaluated in both simulation and the real world, demonstrating significantly improved search efficiency compared to conventional baselines and state-of-the-art methods. The results show that MoMa-LLM outperforms other approaches in terms of success rate, search efficiency, and ability to handle abstract tasks. The method is also shown to be robust to various room layouts and capable of handling open-vocabulary room classification. The paper concludes that MoMa-LLM provides a scalable and efficient solution for interactive object search in large, unexplored environments, leveraging the capabilities of LLMs and dynamic scene graphs to enable effective reasoning and planning. The approach is applicable to a wide range of mobile manipulation and household robotic tasks, demonstrating the potential for further extension to more complex and abstract tasks.
Reach us at info@study.space