2024 | Zhonghan Zhao*, Kewei Chen*, Dongxu Guo*, Wenhao Chai*, Tian Ye*, Yanting Zhang, and Gaoang Wang
The paper introduces HAS, a hierarchical auto-organizing system for multi-agent navigation in the Minecraft environment. HAS leverages large language models (LLMs) to enable agents to navigate complex, open-ended environments by autonomously organizing and collaborating. The system features a hierarchical structure with centralized planning and decentralized execution, allowing agents to dynamically adjust to tasks and maintain communication. It also integrates multi-modal information, enabling agents to process visual, textual, and auditory data for navigation. HAS is designed to handle tasks such as searching and exploring, and it achieves state-of-the-art performance in asynchronous multi-modal navigation tasks. The system includes an auto-organizing and intra-communication mechanism that allows agents to adapt to subtasks and maintain efficient collaboration. Additionally, HAS employs a multi-modal memory system to store and retrieve experiences, enhancing planning accuracy and consistency. The system also incorporates a dynamic map to provide agents with real-time environmental information, improving navigation efficiency. Experimental results show that HAS outperforms baselines in navigation tasks, demonstrating its effectiveness in complex, open-ended environments. The paper highlights the potential of HAS for long-horizon tasks and lifelong learning, marking a significant advancement in embodied AI.The paper introduces HAS, a hierarchical auto-organizing system for multi-agent navigation in the Minecraft environment. HAS leverages large language models (LLMs) to enable agents to navigate complex, open-ended environments by autonomously organizing and collaborating. The system features a hierarchical structure with centralized planning and decentralized execution, allowing agents to dynamically adjust to tasks and maintain communication. It also integrates multi-modal information, enabling agents to process visual, textual, and auditory data for navigation. HAS is designed to handle tasks such as searching and exploring, and it achieves state-of-the-art performance in asynchronous multi-modal navigation tasks. The system includes an auto-organizing and intra-communication mechanism that allows agents to adapt to subtasks and maintain efficient collaboration. Additionally, HAS employs a multi-modal memory system to store and retrieve experiences, enhancing planning accuracy and consistency. The system also incorporates a dynamic map to provide agents with real-time environmental information, improving navigation efficiency. Experimental results show that HAS outperforms baselines in navigation tasks, demonstrating its effectiveness in complex, open-ended environments. The paper highlights the potential of HAS for long-horizon tasks and lifelong learning, marking a significant advancement in embodied AI.