Understanding WILBUR%3A Adaptive In-Context Learning for Robust and Accurate Web Agents

WILBUR is an advanced web agent designed to achieve both generalization and accuracy in web navigation and task execution. It addresses the challenges of high variance in website structures and the inability of existing fine-tuning and in-context learning techniques to generalize across multiple websites. WILBUR introduces a differentiable ranking model and a novel instruction synthesis technique to optimize the prompt for a large language model (LLM) with task demonstrations from previous runs. It also includes an intelligent backtracking mechanism to recover from mistakes, maximizing end-to-end success rates. WILBUR's key contributions include: 1. **Backtracking Mechanism**: The ability to explore, reflect, and backtrack, allowing the agent to learn from and recover from failed actions. 2. **Instruction Synthesis**: Summarizing a large number of successful and unsuccessful actions into concise instructions, enhancing the agent's understanding and adaptability. 3. **Autocurriculum**: Using an LLM-based auto-curriculum to generate plausible goals and collect representative demonstrations, enabling the agent to learn from a diverse set of tasks without manual annotation. On the WebVoyager benchmark, WILBUR achieves state-of-the-art results, outperforming text-only models by 8% overall and up to 36% on certain websites. It is also within 5% of a strong multi-modal model despite receiving only textual inputs. The analysis of failures reveals that many issues are engineering challenges rather than model limitations, highlighting the need for further improvements in web agent design and execution. The paper provides a detailed evaluation of WILBUR, including ablation studies and error analysis, demonstrating its effectiveness and potential for future research in web agent development.WILBUR is an advanced web agent designed to achieve both generalization and accuracy in web navigation and task execution. It addresses the challenges of high variance in website structures and the inability of existing fine-tuning and in-context learning techniques to generalize across multiple websites. WILBUR introduces a differentiable ranking model and a novel instruction synthesis technique to optimize the prompt for a large language model (LLM) with task demonstrations from previous runs. It also includes an intelligent backtracking mechanism to recover from mistakes, maximizing end-to-end success rates. WILBUR's key contributions include: 1. **Backtracking Mechanism**: The ability to explore, reflect, and backtrack, allowing the agent to learn from and recover from failed actions. 2. **Instruction Synthesis**: Summarizing a large number of successful and unsuccessful actions into concise instructions, enhancing the agent's understanding and adaptability. 3. **Autocurriculum**: Using an LLM-based auto-curriculum to generate plausible goals and collect representative demonstrations, enabling the agent to learn from a diverse set of tasks without manual annotation. On the WebVoyager benchmark, WILBUR achieves state-of-the-art results, outperforming text-only models by 8% overall and up to 36% on certain websites. It is also within 5% of a strong multi-modal model despite receiving only textual inputs. The analysis of failures reveals that many issues are engineering challenges rather than model limitations, highlighting the need for further improvements in web agent design and execution. The paper provides a detailed evaluation of WILBUR, including ablation studies and error analysis, demonstrating its effectiveness and potential for future research in web agent development.

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

8 Apr 2024 | Michael Lutz, Arth Bohra, Manvel Saroyan, Artem Harutyunyan, Giovanni Campagna