2024-04-09 | Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue
VisualWebBench is a comprehensive benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding and grounding web pages. The benchmark consists of seven tasks covering website, element, and action-level understanding, reasoning, and grounding. It includes 1.5K human-curated instances from 139 real websites, spanning 87 sub-domains. The evaluation of 14 open-source MLLMs, including Gemini Pro, Claude-3 series, and GPT-4V(ision), reveals significant challenges and performance gaps. Key findings include:
1. **Significant Challenges**: Even powerful models like GPT-4V achieve only average scores, indicating substantial room for improvement.
2. **Performance Gaps**: There is a notable disparity between open-source and proprietary MLLMs, with GPT-4V and Claude outperforming open-source models.
3. **Scaling Effects**: Larger models generally perform better, with the 34B version of LLaVA achieving the highest average score.
4. **General vs. Web-Specific Models**: GUI agent MLLMs do not significantly outperform general models, highlighting the need for specialized training techniques.
5. **Correlations with General and Agent Benchmarks**: Performance on VisualWebBench does not correlate strongly with general or agent-specific benchmarks, emphasizing the need for web-specific evaluation.
6. **Image Resolution**: Most MLLMs struggle with low-resolution images, and higher resolution inputs generally improve performance.
7. **Grounding Capabilities**: Current MLLMs have limited grounding abilities, particularly in text-rich environments.
VisualWebBench aims to serve as a valuable resource for researchers to develop more capable and versatile MLLMs for web-related applications.VisualWebBench is a comprehensive benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding and grounding web pages. The benchmark consists of seven tasks covering website, element, and action-level understanding, reasoning, and grounding. It includes 1.5K human-curated instances from 139 real websites, spanning 87 sub-domains. The evaluation of 14 open-source MLLMs, including Gemini Pro, Claude-3 series, and GPT-4V(ision), reveals significant challenges and performance gaps. Key findings include:
1. **Significant Challenges**: Even powerful models like GPT-4V achieve only average scores, indicating substantial room for improvement.
2. **Performance Gaps**: There is a notable disparity between open-source and proprietary MLLMs, with GPT-4V and Claude outperforming open-source models.
3. **Scaling Effects**: Larger models generally perform better, with the 34B version of LLaVA achieving the highest average score.
4. **General vs. Web-Specific Models**: GUI agent MLLMs do not significantly outperform general models, highlighting the need for specialized training techniques.
5. **Correlations with General and Agent Benchmarks**: Performance on VisualWebBench does not correlate strongly with general or agent-specific benchmarks, emphasizing the need for web-specific evaluation.
6. **Image Resolution**: Most MLLMs struggle with low-resolution images, and higher resolution inputs generally improve performance.
7. **Grounding Capabilities**: Current MLLMs have limited grounding abilities, particularly in text-rich environments.
VisualWebBench aims to serve as a valuable resource for researchers to develop more capable and versatile MLLMs for web-related applications.