Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

12 Jul 2024 | Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, Jie Tan
Mobility VLA is a hierarchical navigation policy that combines long-context Vision-Language Models (VLMs) with topological graphs to enable multimodal instruction navigation (MINT). The system uses a demonstration tour video and multimodal user instructions to navigate a robot to the correct location. The high-level policy identifies the goal frame in the tour video using a VLM, while the low-level policy uses a topological graph to generate robot actions. Mobility VLA was tested in a real-world 836m² office and a home-like environment, achieving high end-to-end success rates on complex tasks involving multimodal instructions. The system outperforms alternatives by leveraging long-context VLMs for high-level goal finding and topological graphs for low-level navigation. Experiments show that Mobility VLA achieves 86% and 90% success rates in real-world environments, demonstrating its effectiveness in handling complex navigation tasks. The system also allows for easy deployment, as users can record a tour with a smartphone and then give instructions to the robot. Mobility VLA's use of long-context VLMs and topological graphs enables robust and efficient navigation, even in challenging scenarios. The system's success is attributed to its ability to handle both text and image instructions, as well as its efficient use of topological graphs for navigation. Overall, Mobility VLA represents a significant advancement in robot navigation, enabling more intuitive and effective interaction with robots.Mobility VLA is a hierarchical navigation policy that combines long-context Vision-Language Models (VLMs) with topological graphs to enable multimodal instruction navigation (MINT). The system uses a demonstration tour video and multimodal user instructions to navigate a robot to the correct location. The high-level policy identifies the goal frame in the tour video using a VLM, while the low-level policy uses a topological graph to generate robot actions. Mobility VLA was tested in a real-world 836m² office and a home-like environment, achieving high end-to-end success rates on complex tasks involving multimodal instructions. The system outperforms alternatives by leveraging long-context VLMs for high-level goal finding and topological graphs for low-level navigation. Experiments show that Mobility VLA achieves 86% and 90% success rates in real-world environments, demonstrating its effectiveness in handling complex navigation tasks. The system also allows for easy deployment, as users can record a tour with a smartphone and then give instructions to the robot. Mobility VLA's use of long-context VLMs and topological graphs enables robust and efficient navigation, even in challenging scenarios. The system's success is attributed to its ability to handle both text and image instructions, as well as its efficient use of topological graphs for navigation. Overall, Mobility VLA represents a significant advancement in robot navigation, enabling more intuitive and effective interaction with robots.
Reach us at info@study.space