9 Apr 2024 | Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue
VisualWebBench is a new multimodal benchmark designed to evaluate the capabilities of large language models (LLMs) in web page understanding and grounding. The benchmark includes seven tasks, covering various aspects of web interaction, such as captioning, webpage QA, OCR, and grounding. It consists of 1,500 human-curated instances from 139 real websites across 87 sub-domains. The benchmark evaluates 14 open-source LLMs, including Gemini Pro, Claude-3 series, and GPT-4V, revealing significant challenges and performance gaps. The results show that current LLMs struggle with tasks requiring fine-grained understanding, such as OCR and grounding, especially in text-rich environments and with low-resolution images. VisualWebBench highlights the limitations of existing benchmarks and emphasizes the need for a comprehensive evaluation framework for web-related tasks. The benchmark provides a standardized way to assess LLMs in web scenarios, enabling the development of more capable and efficient models for web applications. The results also indicate that proprietary models like GPT-4V and Claude series outperform open-source models, while the best open-source model, LLaVA-1.6-34B, achieves a score of 50.5. The benchmark also shows that the performance of LLMs in web tasks is not strongly correlated with their performance in general or agent benchmarks, highlighting the importance of web-specific evaluations. VisualWebBench provides a comprehensive evaluation of LLMs in web scenarios, covering understanding, OCR, grounding, and reasoning. The benchmark is designed to help researchers develop more capable and versatile LLMs for web-related applications.VisualWebBench is a new multimodal benchmark designed to evaluate the capabilities of large language models (LLMs) in web page understanding and grounding. The benchmark includes seven tasks, covering various aspects of web interaction, such as captioning, webpage QA, OCR, and grounding. It consists of 1,500 human-curated instances from 139 real websites across 87 sub-domains. The benchmark evaluates 14 open-source LLMs, including Gemini Pro, Claude-3 series, and GPT-4V, revealing significant challenges and performance gaps. The results show that current LLMs struggle with tasks requiring fine-grained understanding, such as OCR and grounding, especially in text-rich environments and with low-resolution images. VisualWebBench highlights the limitations of existing benchmarks and emphasizes the need for a comprehensive evaluation framework for web-related tasks. The benchmark provides a standardized way to assess LLMs in web scenarios, enabling the development of more capable and efficient models for web applications. The results also indicate that proprietary models like GPT-4V and Claude series outperform open-source models, while the best open-source model, LLaVA-1.6-34B, achieves a score of 50.5. The benchmark also shows that the performance of LLMs in web tasks is not strongly correlated with their performance in general or agent benchmarks, highlighting the importance of web-specific evaluations. VisualWebBench provides a comprehensive evaluation of LLMs in web scenarios, covering understanding, OCR, grounding, and reasoning. The benchmark is designed to help researchers develop more capable and versatile LLMs for web-related applications.