[slides and audio] Mobility VLA%3A Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

The paper "Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs" introduces a novel navigation policy called Mobility VLA, designed to handle multimodal instruction navigation (MINT) tasks. MINT involves understanding and following complex, multimodal instructions (natural language and images) to navigate to specific locations in an environment. The key contributions of the paper are: 1. **Mobility VLA Architecture**: A hierarchical Vision-Language-Action (VLA) policy that combines long-context Vision-Language Models (VLMs) for high-level goal finding and a low-level policy based on topological graphs for action generation. 2. **Long-Context VLMs**: Utilize VLMs to process the demonstration tour video and multimodal user instructions, enabling the system to understand and reason about the environment and user intent. 3. **Topological Graphs**: Constructed offline from the demonstration tour video, these graphs serve as a low-level navigation guide, ensuring efficient and accurate robot movements. 4. **Evaluation**: The system was evaluated in a 836m² real-world office environment and a home-like setting, achieving high end-to-end success rates on previously infeasible MINT tasks, including complex reasoning and multimodal instructions. 5. **Contributions**: The paper proposes a new paradigm for robot navigation, making robots more intuitive and helpful in daily life, and demonstrates the effectiveness of long-context VLMs and topological graphs in solving MINT tasks. 6. **Related Work**: The paper reviews classical navigation methods, object and visual navigation techniques, and the role of VLMs in navigation, highlighting the unique contributions of Mobility VLA. 7. **Experiments**: Detailed experiments in real and simulated environments show the robustness and effectiveness of Mobility VLA, with high success rates and path lengths, and the importance of long-context VLMs and topological graphs. 8. **Discussion**: The paper discusses limitations, such as the lack of automatic exploration and the long inference time of VLMs, and outlines future work, including deploying Mobility VLA on different robot embodiments and exploring its capabilities beyond navigation. Overall, Mobility VLA represents a significant advancement in multimodal instruction navigation, making robots more capable and user-friendly in complex environments.The paper "Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs" introduces a novel navigation policy called Mobility VLA, designed to handle multimodal instruction navigation (MINT) tasks. MINT involves understanding and following complex, multimodal instructions (natural language and images) to navigate to specific locations in an environment. The key contributions of the paper are: 1. **Mobility VLA Architecture**: A hierarchical Vision-Language-Action (VLA) policy that combines long-context Vision-Language Models (VLMs) for high-level goal finding and a low-level policy based on topological graphs for action generation. 2. **Long-Context VLMs**: Utilize VLMs to process the demonstration tour video and multimodal user instructions, enabling the system to understand and reason about the environment and user intent. 3. **Topological Graphs**: Constructed offline from the demonstration tour video, these graphs serve as a low-level navigation guide, ensuring efficient and accurate robot movements. 4. **Evaluation**: The system was evaluated in a 836m² real-world office environment and a home-like setting, achieving high end-to-end success rates on previously infeasible MINT tasks, including complex reasoning and multimodal instructions. 5. **Contributions**: The paper proposes a new paradigm for robot navigation, making robots more intuitive and helpful in daily life, and demonstrates the effectiveness of long-context VLMs and topological graphs in solving MINT tasks. 6. **Related Work**: The paper reviews classical navigation methods, object and visual navigation techniques, and the role of VLMs in navigation, highlighting the unique contributions of Mobility VLA. 7. **Experiments**: Detailed experiments in real and simulated environments show the robustness and effectiveness of Mobility VLA, with high success rates and path lengths, and the importance of long-context VLMs and topological graphs. 8. **Discussion**: The paper discusses limitations, such as the lack of automatic exploration and the long inference time of VLMs, and outlines future work, including deploying Mobility VLA on different robot embodiments and exploring its capabilities beyond navigation. Overall, Mobility VLA represents a significant advancement in multimodal instruction navigation, making robots more capable and user-friendly in complex environments.