Understanding VisualWebBench%3A How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding%3F

VisualWebBench is a comprehensive benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding and grounding web pages. The benchmark consists of seven tasks covering website, element, and action-level understanding, reasoning, and grounding. It includes 1.5K human-curated instances from 139 real websites, spanning 87 sub-domains. The evaluation of 14 open-source MLLMs, including Gemini Pro, Claude-3 series, and GPT-4V(ision), reveals significant challenges and performance gaps. Key findings include: 1. **Significant Challenges**: Even powerful models like GPT-4V achieve only average scores, indicating substantial room for improvement. 2. **Performance Gaps**: There is a notable disparity between open-source and proprietary MLLMs, with GPT-4V and Claude outperforming open-source models. 3. **Scaling Effects**: Larger models generally perform better, with the 34B version of LLaVA achieving the highest average score. 4. **General vs. Web-Specific Models**: GUI agent MLLMs do not significantly outperform general models, highlighting the need for specialized training techniques. 5. **Correlations with General and Agent Benchmarks**: Performance on VisualWebBench does not correlate strongly with general or agent-specific benchmarks, emphasizing the need for web-specific evaluation. 6. **Image Resolution**: Most MLLMs struggle with low-resolution images, and higher resolution inputs generally improve performance. 7. **Grounding Capabilities**: Current MLLMs have limited grounding abilities, particularly in text-rich environments. VisualWebBench aims to serve as a valuable resource for researchers to develop more capable and versatile MLLMs for web-related applications.VisualWebBench is a comprehensive benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding and grounding web pages. The benchmark consists of seven tasks covering website, element, and action-level understanding, reasoning, and grounding. It includes 1.5K human-curated instances from 139 real websites, spanning 87 sub-domains. The evaluation of 14 open-source MLLMs, including Gemini Pro, Claude-3 series, and GPT-4V(ision), reveals significant challenges and performance gaps. Key findings include: 1. **Significant Challenges**: Even powerful models like GPT-4V achieve only average scores, indicating substantial room for improvement. 2. **Performance Gaps**: There is a notable disparity between open-source and proprietary MLLMs, with GPT-4V and Claude outperforming open-source models. 3. **Scaling Effects**: Larger models generally perform better, with the 34B version of LLaVA achieving the highest average score. 4. **General vs. Web-Specific Models**: GUI agent MLLMs do not significantly outperform general models, highlighting the need for specialized training techniques. 5. **Correlations with General and Agent Benchmarks**: Performance on VisualWebBench does not correlate strongly with general or agent-specific benchmarks, emphasizing the need for web-specific evaluation. 6. **Image Resolution**: Most MLLMs struggle with low-resolution images, and higher resolution inputs generally improve performance. 7. **Grounding Capabilities**: Current MLLMs have limited grounding abilities, particularly in text-rich environments. VisualWebBench aims to serve as a valuable resource for researchers to develop more capable and versatile MLLMs for web-related applications.

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

2024-04-09 | Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue