[slides] Dual-View Visual Contextualization for Web Navigation

The paper "Dual-View Visual Contextualization for Web Navigation" addresses the challenge of automatic web navigation, where the goal is to build a web agent that can follow natural language instructions to perform complex tasks on real-world websites. The authors propose a method called Dual-View Contextualized Representation (DUAL-VCR) to enhance the context of HTML elements by leveraging their "dual views" in webpage screenshots. Each HTML element has a corresponding bounding box and visual content in the screenshot, and DUAL-VCR contextualizes each element with its neighbors using both textual and visual features. This approach aligns with the insight that semantically related and task-related elements are often located nearby on webpages to enhance user experiences. The method is evaluated on the Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. DUAL-VCR consistently outperforms baselines in all scenarios, including cross-task, cross-website, and cross-domain tasks, achieving a 3.7% absolute gain on average over nine evaluation metrics. The paper also includes comprehensive analyses to understand the impact of different components of DUAL-VCR on web navigation performance, demonstrating the effectiveness of visual neighbor information in enhancing the agent's decision-making process.The paper "Dual-View Visual Contextualization for Web Navigation" addresses the challenge of automatic web navigation, where the goal is to build a web agent that can follow natural language instructions to perform complex tasks on real-world websites. The authors propose a method called Dual-View Contextualized Representation (DUAL-VCR) to enhance the context of HTML elements by leveraging their "dual views" in webpage screenshots. Each HTML element has a corresponding bounding box and visual content in the screenshot, and DUAL-VCR contextualizes each element with its neighbors using both textual and visual features. This approach aligns with the insight that semantically related and task-related elements are often located nearby on webpages to enhance user experiences. The method is evaluated on the Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. DUAL-VCR consistently outperforms baselines in all scenarios, including cross-task, cross-website, and cross-domain tasks, achieving a 3.7% absolute gain on average over nine evaluation metrics. The paper also includes comprehensive analyses to understand the impact of different components of DUAL-VCR on web navigation performance, demonstrating the effectiveness of visual neighbor information in enhancing the agent's decision-making process.

Dual-View Visual Contextualization for Web Navigation

2024-03-30 | Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao