WEBLINX: Real-World Website Navigation with Multi-Turn Dialogue

WEBLINX: Real-World Website Navigation with Multi-Turn Dialogue

8 Feb 2024 | Xing Han Lu, Zdeněk Kasner, Siva Reddy
The paper introduces the problem of *conversational web navigation*, where a digital agent controls a web browser and follows user instructions to solve real-world tasks through multi-turn dialogue. To address this, the authors propose WEBLINX, a large-scale benchmark consisting of 100K interactions across 2300 expert demonstrations on 155 real-world websites. The benchmark covers a broad range of patterns and can be used to train and evaluate agents in diverse scenarios. Due to the complexity of processing entire web pages in real-time, the authors design a retrieval-inspired model called Dense Markup Ranking (DMR) to efficiently prune HTML pages by ranking relevant elements. This model, combined with screenshots and action history, is used to assess various models for their ability to replicate human behavior in web navigation. The experiments span from small text-only models to larger multimodal LLMs, with findings showing that smaller finetuned decoders outperform zero-shot LLMs, including GPT-4V. However, all models struggle to generalize to unseen websites, highlighting the need for large multimodal models that can handle novel settings. The code, data, and models are available for research.The paper introduces the problem of *conversational web navigation*, where a digital agent controls a web browser and follows user instructions to solve real-world tasks through multi-turn dialogue. To address this, the authors propose WEBLINX, a large-scale benchmark consisting of 100K interactions across 2300 expert demonstrations on 155 real-world websites. The benchmark covers a broad range of patterns and can be used to train and evaluate agents in diverse scenarios. Due to the complexity of processing entire web pages in real-time, the authors design a retrieval-inspired model called Dense Markup Ranking (DMR) to efficiently prune HTML pages by ranking relevant elements. This model, combined with screenshots and action history, is used to assess various models for their ability to replicate human behavior in web navigation. The experiments span from small text-only models to larger multimodal LLMs, with findings showing that smaller finetuned decoders outperform zero-shot LLMs, including GPT-4V. However, all models struggle to generalize to unseen websites, highlighting the need for large multimodal models that can handle novel settings. The code, data, and models are available for research.
Reach us at info@study.space