16 Jun 2024 | Yujie Lu, Dongfu Jiang, Wenhui Chen, William Yang Wang, Yejin Choi, Bill Yuchen Lin
The paper introduces WILDVISION-ARENA (WV-ARENA) and WILDVISION-BENCH (WV-BENCH) to evaluate vision-language models (VLMs) in real-world scenarios. WV-ARENA is an online platform that supports multi-round multimodal chats with over 20 models, allowing users to compare and vote on responses. WV-BENCH curates 500 samples from 20,000+ user submissions, using GPT-4 as the judge to compare each VLM with Claude-3-Sonnet. The Spearman correlation between WV-BENCH scores and WV-ARENA Elo ratings is 0.94, outperforming other benchmarks. The analysis of 20,000+ multimodal conversations reveals areas for improvement, such as visual context recognition, spatial reasoning, and expert domain knowledge. The paper also discusses failure cases and provides a comprehensive leaderboard of VLMs, aiming to advance research in VLMs by releasing chat and feedback data.The paper introduces WILDVISION-ARENA (WV-ARENA) and WILDVISION-BENCH (WV-BENCH) to evaluate vision-language models (VLMs) in real-world scenarios. WV-ARENA is an online platform that supports multi-round multimodal chats with over 20 models, allowing users to compare and vote on responses. WV-BENCH curates 500 samples from 20,000+ user submissions, using GPT-4 as the judge to compare each VLM with Claude-3-Sonnet. The Spearman correlation between WV-BENCH scores and WV-ARENA Elo ratings is 0.94, outperforming other benchmarks. The analysis of 20,000+ multimodal conversations reveals areas for improvement, such as visual context recognition, spatial reasoning, and expert domain knowledge. The paper also discusses failure cases and provides a comprehensive leaderboard of VLMs, aiming to advance research in VLMs by releasing chat and feedback data.