20 Jun 2024 | Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, Kwan-Yee K. Wong
**MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation**
**Authors:** Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, Kwan-Yee K. Wong
**Project:** <https://chen-judge.github.io/MapGPT/>
**Abstract:**
Embodied agents equipped with GPT have demonstrated exceptional decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" to understand the overall environment. This work introduces MapGPT, a novel map-guided GPT-based agent that encourages global exploration by building an online linguistic-formed map. The map includes node information and topological relationships, helping GPT understand the spatial environment. An adaptive planning mechanism is proposed to assist the agent in performing multi-step path planning based on the map, enabling systematic exploration of multiple candidate nodes or sub-goals. Extensive experiments show that MapGPT achieves state-of-the-art zero-shot performance on R2R and REVERIE datasets, demonstrating enhanced global thinking and path planning abilities of GPT.
**Contributions:**
- Introduce a map-guided prompting method to build an online linguistic-formed map, including node information and topological relationships.
- Propose an adaptive planning mechanism to enable multi-step path planning based on the map.
- Achieve state-of-the-art zero-shot performance on R2R and REVERIE datasets, showcasing improved global thinking and path planning capabilities.
**Methods:**
- **Single Expert Prompt System:** A unified prompt system that integrates task description, fundamental inputs (instruction, visual observation, action space, history), and map annotations.
- **Map-Guided Prompting:** Convert topological relationships into textual prompts to help GPT understand the navigation environment.
- **Adaptive Path Planning:** Enable multi-step path planning by dynamically generating and updating a plan based on the map and previous planning.
**Experiments:**
- **Datasets and Evaluation:** R2R and REVERIE datasets with detailed step-by-step and high-level instructions, respectively.
- **Results:** MapGPT outperforms previous methods on both datasets, achieving significant improvements in success rate and Oracle success rate.
**Conclusion:**
MapGPT is a novel zero-shot agent for VLN tasks, leveraging map-guided prompting and adaptive path planning to achieve state-of-the-art performance. Future work could focus on real-world deployment and addressing real-world challenges.**MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation**
**Authors:** Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, Kwan-Yee K. Wong
**Project:** <https://chen-judge.github.io/MapGPT/>
**Abstract:**
Embodied agents equipped with GPT have demonstrated exceptional decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" to understand the overall environment. This work introduces MapGPT, a novel map-guided GPT-based agent that encourages global exploration by building an online linguistic-formed map. The map includes node information and topological relationships, helping GPT understand the spatial environment. An adaptive planning mechanism is proposed to assist the agent in performing multi-step path planning based on the map, enabling systematic exploration of multiple candidate nodes or sub-goals. Extensive experiments show that MapGPT achieves state-of-the-art zero-shot performance on R2R and REVERIE datasets, demonstrating enhanced global thinking and path planning abilities of GPT.
**Contributions:**
- Introduce a map-guided prompting method to build an online linguistic-formed map, including node information and topological relationships.
- Propose an adaptive planning mechanism to enable multi-step path planning based on the map.
- Achieve state-of-the-art zero-shot performance on R2R and REVERIE datasets, showcasing improved global thinking and path planning capabilities.
**Methods:**
- **Single Expert Prompt System:** A unified prompt system that integrates task description, fundamental inputs (instruction, visual observation, action space, history), and map annotations.
- **Map-Guided Prompting:** Convert topological relationships into textual prompts to help GPT understand the navigation environment.
- **Adaptive Path Planning:** Enable multi-step path planning by dynamically generating and updating a plan based on the map and previous planning.
**Experiments:**
- **Datasets and Evaluation:** R2R and REVERIE datasets with detailed step-by-step and high-level instructions, respectively.
- **Results:** MapGPT outperforms previous methods on both datasets, achieving significant improvements in success rate and Oracle success rate.
**Conclusion:**
MapGPT is a novel zero-shot agent for VLN tasks, leveraging map-guided prompting and adaptive path planning to achieve state-of-the-art performance. Future work could focus on real-world deployment and addressing real-world challenges.