RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

6 Feb 2024 | Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, Yang You
RAP (Retrieval-Augmented Planning) is a framework that enhances the planning capabilities of Large Language Models (LLMs) by leveraging past experiences in both text and multimodal environments. The framework stores past experiences in memory, retrieves them based on similarity with the current situation, and uses in-context learning to generate subsequent actions. This approach allows LLM agents to make more informed decisions by drawing on previously successful experiences. RAP is designed to be versatile, excelling in both text-only and multimodal environments. It enables agents to adapt to complex, real-world scenarios by utilizing contextual memory. The framework includes four core components: Memory, Reasoner, Retriever, and Executor. The Memory stores past experiences, the Reasoner generates plans and retrieval keys, the Retriever finds relevant past experiences, and the Executor uses these experiences to generate actions. Empirical evaluations show that RAP achieves state-of-the-art performance in textual scenarios and significantly improves the performance of multimodal LLM agents in embodied tasks. RAP outperforms existing methods like ReAct, Reflexion, and ADaPT in benchmarks such as ALFWorld, Webshop, Franka Kitchen, and MetaWorld. The framework demonstrates its effectiveness across various LLMs, including GPT-3.5, GPT-4, and Llama2-13b. RAP's ability to retrieve and utilize past experiences in both text and multimodal environments makes it a powerful tool for enhancing the decision-making capabilities of LLM agents. The framework's success in diverse tasks highlights its potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.RAP (Retrieval-Augmented Planning) is a framework that enhances the planning capabilities of Large Language Models (LLMs) by leveraging past experiences in both text and multimodal environments. The framework stores past experiences in memory, retrieves them based on similarity with the current situation, and uses in-context learning to generate subsequent actions. This approach allows LLM agents to make more informed decisions by drawing on previously successful experiences. RAP is designed to be versatile, excelling in both text-only and multimodal environments. It enables agents to adapt to complex, real-world scenarios by utilizing contextual memory. The framework includes four core components: Memory, Reasoner, Retriever, and Executor. The Memory stores past experiences, the Reasoner generates plans and retrieval keys, the Retriever finds relevant past experiences, and the Executor uses these experiences to generate actions. Empirical evaluations show that RAP achieves state-of-the-art performance in textual scenarios and significantly improves the performance of multimodal LLM agents in embodied tasks. RAP outperforms existing methods like ReAct, Reflexion, and ADaPT in benchmarks such as ALFWorld, Webshop, Franka Kitchen, and MetaWorld. The framework demonstrates its effectiveness across various LLMs, including GPT-3.5, GPT-4, and Llama2-13b. RAP's ability to retrieve and utilize past experiences in both text and multimodal environments makes it a powerful tool for enhancing the decision-making capabilities of LLM agents. The framework's success in diverse tasks highlights its potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.
Reach us at info@study.space
[slides] RAP%3A Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents | StudySpace