8 Feb 2024 | Xing Han Lü, Zdeněk Kasner, Siva Reddy
WEBLINX is a large-scale benchmark for conversational web navigation, consisting of 2337 expert demonstrations across 155 real-world websites. The benchmark includes over 100,000 interactions and covers a wide range of web navigation patterns. It is designed to train and evaluate agents in diverse scenarios, including tasks such as helping visually impaired users navigate websites through a chat interface, enhancing smart speakers with voice-controlled web navigation, and improving productivity by reducing repetitive steps. The benchmark includes a variety of metrics to assess model performance, including intent match, element similarity, and text similarity.
To address the challenge of processing large HTML pages, the authors propose a method called Dense Markup Ranking (DMR), which efficiently prunes HTML pages by ranking relevant elements. This method allows for the creation of a compact representation of the DOM, which can be used to evaluate a wide range of models. The benchmark also includes a variety of models, including smaller text-only decoders and larger multimodal models, to assess their ability to navigate websites.
The experiments show that smaller text-only decoders outperform multimodal LLMs, but all finetuned models struggle to generalize to new settings. The findings highlight the need for large multimodal models that can generalize to novel settings. The authors also find that even the best zero-shot model, GPT-4V, is surpassed by finetuned models. However, all models face challenges in generalizing to new settings, such as unseen websites from a different geographic location or when the instructor gives instructions without seeing the screen.
The benchmark is designed to assess the ability of LLM agents to not only follow self-contained instructions but also engage with their environment through dialogue and generalize to unforeseen situations. The authors believe that significant effort will be needed to make progress on the problem of conversational web navigation. The benchmark is available for research, and the code, data, and models are provided for further study.WEBLINX is a large-scale benchmark for conversational web navigation, consisting of 2337 expert demonstrations across 155 real-world websites. The benchmark includes over 100,000 interactions and covers a wide range of web navigation patterns. It is designed to train and evaluate agents in diverse scenarios, including tasks such as helping visually impaired users navigate websites through a chat interface, enhancing smart speakers with voice-controlled web navigation, and improving productivity by reducing repetitive steps. The benchmark includes a variety of metrics to assess model performance, including intent match, element similarity, and text similarity.
To address the challenge of processing large HTML pages, the authors propose a method called Dense Markup Ranking (DMR), which efficiently prunes HTML pages by ranking relevant elements. This method allows for the creation of a compact representation of the DOM, which can be used to evaluate a wide range of models. The benchmark also includes a variety of models, including smaller text-only decoders and larger multimodal models, to assess their ability to navigate websites.
The experiments show that smaller text-only decoders outperform multimodal LLMs, but all finetuned models struggle to generalize to new settings. The findings highlight the need for large multimodal models that can generalize to novel settings. The authors also find that even the best zero-shot model, GPT-4V, is surpassed by finetuned models. However, all models face challenges in generalizing to new settings, such as unseen websites from a different geographic location or when the instructor gives instructions without seeing the screen.
The benchmark is designed to assess the ability of LLM agents to not only follow self-contained instructions but also engage with their environment through dialogue and generalize to unforeseen situations. The authors believe that significant effort will be needed to make progress on the problem of conversational web navigation. The benchmark is available for research, and the code, data, and models are provided for further study.