On the Multi-turn Instruction Following for Conversational Web Agents

On the Multi-turn Instruction Following for Conversational Web Agents

23 Feb 2024 | Yang Deng1*, Xuan Zhang1*, Wenxuan Zhang2, Yifei Yuan3, See-Kiong Ng1, Tat-Seng Chua1
This paper introduces a new task called Conversational Web Navigation, which involves sophisticated interactions with both users and the web environment over multiple turns. To address this challenge, the authors propose a novel framework named Self-Reflective Memory-Augmented Planning (Self-MAP), which combines memory utilization and self-reflection techniques. The dataset used for this task is named Multi-Turn Mind2Web (MT-Mind2Web), constructed by using single-turn interactions from the Mind2Web dataset as guidance to create conversation sessions. The Self-MAP framework consists of three main components: Memory, Reflection, and Planning Modules. The Memory Module constructs a memory bank using conversational interaction history, while the Reflection Module simplifies and refines the retrieved memory snippets to filter out irrelevant information and enrich the memory with reasoning rationales. The Planning Module then uses the self-reflective memory to plan the next action. Extensive experiments are conducted to benchmark the MT-Mind2Web dataset and validate the effectiveness of the proposed method. The results show that Self-MAP consistently outperforms existing baselines, demonstrating the effectiveness of the proposed framework in handling multi-turn instruction-following tasks for web agents.This paper introduces a new task called Conversational Web Navigation, which involves sophisticated interactions with both users and the web environment over multiple turns. To address this challenge, the authors propose a novel framework named Self-Reflective Memory-Augmented Planning (Self-MAP), which combines memory utilization and self-reflection techniques. The dataset used for this task is named Multi-Turn Mind2Web (MT-Mind2Web), constructed by using single-turn interactions from the Mind2Web dataset as guidance to create conversation sessions. The Self-MAP framework consists of three main components: Memory, Reflection, and Planning Modules. The Memory Module constructs a memory bank using conversational interaction history, while the Reflection Module simplifies and refines the retrieved memory snippets to filter out irrelevant information and enrich the memory with reasoning rationales. The Planning Module then uses the self-reflective memory to plan the next action. Extensive experiments are conducted to benchmark the MT-Mind2Web dataset and validate the effectiveness of the proposed method. The results show that Self-MAP consistently outperforms existing baselines, demonstrating the effectiveness of the proposed framework in handling multi-turn instruction-following tasks for web agents.
Reach us at info@study.space