Dual-View Visual Contextualization for Web Navigation

Dual-View Visual Contextualization for Web Navigation

30 Mar 2024 | Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao
This paper proposes DUAL-VCR, a dual-view contextualized representation method for web navigation. The method enhances the context of HTML elements by leveraging their "dual views" in webpage screenshots: each HTML element has a corresponding bounding box and visual content in the screenshot. The approach is based on the insight that web developers often arrange task-related elements nearby on webpages to enhance user experiences. DUAL-VCR contextualizes each element with its neighbor elements, using both textual and visual features. This results in more informative representations for the agent to take action. The method is validated on the Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. DUAL-VCR consistently outperforms the baseline in all scenarios, including cross-task, cross-website, and cross-domain ones. The method improves the performance of the MindAct algorithm by integrating DUAL-VCR into both the element ranking and action prediction steps. DUAL-VCR provides significant advantages in computation and accuracy compared to baselines that use entire HTML documents or screenshots as input. The contributions of this work include proposing DUAL-VCR, demonstrating its effectiveness on the Mind2Web benchmark, and conducting comprehensive analyses to understand the effect of design choices on web navigation performance.This paper proposes DUAL-VCR, a dual-view contextualized representation method for web navigation. The method enhances the context of HTML elements by leveraging their "dual views" in webpage screenshots: each HTML element has a corresponding bounding box and visual content in the screenshot. The approach is based on the insight that web developers often arrange task-related elements nearby on webpages to enhance user experiences. DUAL-VCR contextualizes each element with its neighbor elements, using both textual and visual features. This results in more informative representations for the agent to take action. The method is validated on the Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. DUAL-VCR consistently outperforms the baseline in all scenarios, including cross-task, cross-website, and cross-domain ones. The method improves the performance of the MindAct algorithm by integrating DUAL-VCR into both the element ranking and action prediction steps. DUAL-VCR provides significant advantages in computation and accuracy compared to baselines that use entire HTML documents or screenshots as input. The contributions of this work include proposing DUAL-VCR, demonstrating its effectiveness on the Mind2Web benchmark, and conducting comprehensive analyses to understand the effect of design choices on web navigation performance.
Reach us at info@study.space
[slides and audio] Dual-View Visual Contextualization for Web Navigation