23 Feb 2024 | Yang Deng*, Xuan Zhang*, Wenxuan Zhang², Yifei Yuan³, See-Kiong Ng¹, Tat-Seng Chua¹
This paper introduces a new task called Conversational Web Navigation, which requires agents to engage in multi-turn interactions with both users and the web environment. The task is supported by a newly developed dataset called Multi-Turn Mind2Web (MT-Mind2Web), which is constructed from the single-turn interactions of the Mind2Web dataset. The paper proposes a novel framework called Self-Reflective Memory-Augmented Planning (Self-MAP) to address the challenges of multi-turn instruction following in web navigation tasks. Self-MAP combines memory utilization and self-reflection techniques to maximize the utility of the limited memory space of LLM-powered agents. The framework first constructs a memory bank using the conversational interaction history, where each memory snippet stores each interaction step at each conversation turn. It then retrieves memory snippets that are semantically relevant and have similar trajectories, filters out irrelevant information from the environment state, and refines the retrieved memory snippets by generating reasoning rationales. Finally, it plans the next action by utilizing the self-reflective memory. The paper conducts extensive experiments to benchmark the MT-Mind2Web dataset and validate the effectiveness of the proposed method. The results show that Self-MAP consistently outperforms existing baselines in terms of task success rate and other evaluation metrics. The paper also presents an ablation study to validate the specific designs of the Self-MAP framework, showing that memory simplification and refinement are critical for improving performance. The study highlights the importance of efficient memory management and the effectiveness of the proposed memory-augmented planning framework. The paper concludes that the proposed approach is effective for multi-turn instruction following in web navigation tasks.This paper introduces a new task called Conversational Web Navigation, which requires agents to engage in multi-turn interactions with both users and the web environment. The task is supported by a newly developed dataset called Multi-Turn Mind2Web (MT-Mind2Web), which is constructed from the single-turn interactions of the Mind2Web dataset. The paper proposes a novel framework called Self-Reflective Memory-Augmented Planning (Self-MAP) to address the challenges of multi-turn instruction following in web navigation tasks. Self-MAP combines memory utilization and self-reflection techniques to maximize the utility of the limited memory space of LLM-powered agents. The framework first constructs a memory bank using the conversational interaction history, where each memory snippet stores each interaction step at each conversation turn. It then retrieves memory snippets that are semantically relevant and have similar trajectories, filters out irrelevant information from the environment state, and refines the retrieved memory snippets by generating reasoning rationales. Finally, it plans the next action by utilizing the self-reflective memory. The paper conducts extensive experiments to benchmark the MT-Mind2Web dataset and validate the effectiveness of the proposed method. The results show that Self-MAP consistently outperforms existing baselines in terms of task success rate and other evaluation metrics. The paper also presents an ablation study to validate the specific designs of the Self-MAP framework, showing that memory simplification and refinement are critical for improving performance. The study highlights the importance of efficient memory management and the effectiveness of the proposed memory-augmented planning framework. The paper concludes that the proposed approach is effective for multi-turn instruction following in web navigation tasks.