[slides] RAP%3A Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

The paper introduces Retrieval-Augmented Planning (RAP), a novel framework designed to enhance the planning capabilities of Large Language Models (LLMs) by leveraging past experiences. RAP is particularly effective in both text-only and multimodal environments, making it suitable for a wide range of tasks. The framework consists of four core components: Memory, Reasoner, Retriever, and Executor. Memory stores past experiences, including task information, plans, and trajectories. The Reasoner generates overall and action plans based on the current situation, while the Retriever extracts relevant past experiences from the memory. The Executor uses in-context learning to generate actions based on the retrieved experiences. Empirical evaluations on various benchmarks, including ALFWorld, Webshop, Franka Kitchen, and Meta World, demonstrate that RAP significantly improves the performance of LLM agents, achieving state-of-the-art results in textual scenarios and enhancing multimodal agents' performance in embodied tasks. The paper also includes an ablation study to validate the effectiveness of different components of RAP and a transfer learning experiment to show the generalizability of the framework across different models.The paper introduces Retrieval-Augmented Planning (RAP), a novel framework designed to enhance the planning capabilities of Large Language Models (LLMs) by leveraging past experiences. RAP is particularly effective in both text-only and multimodal environments, making it suitable for a wide range of tasks. The framework consists of four core components: Memory, Reasoner, Retriever, and Executor. Memory stores past experiences, including task information, plans, and trajectories. The Reasoner generates overall and action plans based on the current situation, while the Retriever extracts relevant past experiences from the memory. The Executor uses in-context learning to generate actions based on the retrieved experiences. Empirical evaluations on various benchmarks, including ALFWorld, Webshop, Franka Kitchen, and Meta World, demonstrate that RAP significantly improves the performance of LLM agents, achieving state-of-the-art results in textual scenarios and enhancing multimodal agents' performance in embodied tasks. The paper also includes an ablation study to validate the effectiveness of different components of RAP and a transfer learning experiment to show the generalizability of the framework across different models.

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

6 Feb 2024 | Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, Yang You