NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

17 Jul 2024 | Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, Qi Wu
NavGPT-2 is a system designed to bridge the gap between large vision-language models (LLMs) and specialized vision-and-language navigation (VLN) agents. The authors address the performance gap between LLM-based agents and state-of-the-art VLN specialists, which is attributed to the underutilization of LLMs' interpretative capabilities in navigation tasks. By integrating visual content into a frozen LLM, NavGPT-2 enables the model to understand visual observations and incorporate them into navigation policy networks. The system leverages GPT-4V to generate navigational reasoning data and performs visual instruction tuning. It uses a topological graph-based navigation policy to facilitate effective action predictions and navigational reasoning. The proposed method demonstrates data efficiency and eliminates the performance gap between LLM-based agents and VLN specialists, while maintaining the interpretative prowess of LLMs. The source code is available at <https://github.com/GengzeZhou/NavGPT-2>.NavGPT-2 is a system designed to bridge the gap between large vision-language models (LLMs) and specialized vision-and-language navigation (VLN) agents. The authors address the performance gap between LLM-based agents and state-of-the-art VLN specialists, which is attributed to the underutilization of LLMs' interpretative capabilities in navigation tasks. By integrating visual content into a frozen LLM, NavGPT-2 enables the model to understand visual observations and incorporate them into navigation policy networks. The system leverages GPT-4V to generate navigational reasoning data and performs visual instruction tuning. It uses a topological graph-based navigation policy to facilitate effective action predictions and navigational reasoning. The proposed method demonstrates data efficiency and eliminates the performance gap between LLM-based agents and VLN specialists, while maintaining the interpretative prowess of LLMs. The source code is available at <https://github.com/GengzeZhou/NavGPT-2>.
Reach us at info@study.space
[slides] NavGPT-2%3A Unleashing Navigational Reasoning Capability for Large Vision-Language Models | StudySpace