Understanding NavGPT-2%3A Unleashing Navigational Reasoning Capability for Large Vision-Language Models

NavGPT-2 is a system designed to enhance navigational reasoning capabilities for large vision-language models (VLMs) in vision-and-language navigation (VLN) tasks. The paper addresses the challenge of integrating large language models (LLMs) into VLN, where previous approaches either rely on zero-shot methods with limited performance or fine-tuned LLMs that lose general language capabilities. NavGPT-2 bridges this gap by combining VLMs with navigation policy networks, enabling efficient navigation while preserving the interpretability of LLMs. The system leverages a frozen LLM to generate navigational reasoning, using visual observations and instructions to guide navigation decisions. It employs a topological graph-based navigation policy to plan actions and backtrack when necessary. The model is trained using a multi-stage approach, incorporating visual instruction tuning and data generation from the R2R dataset. NavGPT-2 outperforms existing methods in navigation tasks, demonstrating superior performance in metrics such as success rate and path length. The system also generates interpretable navigational reasoning, allowing users to understand the agent's decision-making process. The paper highlights the effectiveness of using VLMs as a visual-linguistic representation, enabling better alignment between vision, language, and action in navigation tasks. NavGPT-2 is evaluated on multiple datasets and shows strong generalization capabilities across different environments and instruction formats. The results indicate that NavGPT-2 provides a scalable and efficient solution for VLN, combining the strengths of LLMs and VLMs to achieve effective navigation.NavGPT-2 is a system designed to enhance navigational reasoning capabilities for large vision-language models (VLMs) in vision-and-language navigation (VLN) tasks. The paper addresses the challenge of integrating large language models (LLMs) into VLN, where previous approaches either rely on zero-shot methods with limited performance or fine-tuned LLMs that lose general language capabilities. NavGPT-2 bridges this gap by combining VLMs with navigation policy networks, enabling efficient navigation while preserving the interpretability of LLMs. The system leverages a frozen LLM to generate navigational reasoning, using visual observations and instructions to guide navigation decisions. It employs a topological graph-based navigation policy to plan actions and backtrack when necessary. The model is trained using a multi-stage approach, incorporating visual instruction tuning and data generation from the R2R dataset. NavGPT-2 outperforms existing methods in navigation tasks, demonstrating superior performance in metrics such as success rate and path length. The system also generates interpretable navigational reasoning, allowing users to understand the agent's decision-making process. The paper highlights the effectiveness of using VLMs as a visual-linguistic representation, enabling better alignment between vision, language, and action in navigation tasks. NavGPT-2 is evaluated on multiple datasets and shows strong generalization capabilities across different environments and instruction formats. The results indicate that NavGPT-2 provides a scalable and efficient solution for VLN, combining the strengths of LLMs and VLMs to achieve effective navigation.

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

17 Jul 2024 | Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu