WILDVISION: Evaluating Vision-Language Models in the Wild with Human Preferences

WILDVISION: Evaluating Vision-Language Models in the Wild with Human Preferences

24 Jun 2024 | Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, Bill Yuchen Lin
WILDVISION-ARENA is an online platform that evaluates vision-language models (VLMs) by collecting human preferences. It supports multi-round multimodal chats with over 20 models, enabling real-world comparisons. WILDVISION-BENCH is a benchmark curated from 500 high-quality samples, closely aligned with human preferences, achieving a Spearman correlation of 0.94 with the Elo ratings in WILDVISION-ARENA. The platform collects over 20,000 human-AI chat interactions, including 8,000 votes and detailed feedback. Analysis of these interactions reveals important insights into the performance of top VLMs, including challenges with subtle contextual cues, spatial reasoning, and expert domain knowledge. Current VLMs also face issues with hallucinations and safety when provoked. The platform provides a comprehensive dataset for future research in VLMs. WILDVISION-BENCH is a challenging and natural benchmark that reflects real-world human use cases, with rankings closely aligned with the WILDVISION-ARENA leaderboard. The benchmark achieves a 0.94 Spearman correlation with the WILDVISION-ARENA leaderboard. Analysis of failure cases highlights the limitations of VLMs in recognizing visual context, spatial reasoning, and expert domain knowledge. The platform also provides a live leaderboard and failure case analysis to track recent advancements in VLMs. The research contributes to the development of more effective and safe VLMs by addressing the gap between benchmark metrics and human preferences.WILDVISION-ARENA is an online platform that evaluates vision-language models (VLMs) by collecting human preferences. It supports multi-round multimodal chats with over 20 models, enabling real-world comparisons. WILDVISION-BENCH is a benchmark curated from 500 high-quality samples, closely aligned with human preferences, achieving a Spearman correlation of 0.94 with the Elo ratings in WILDVISION-ARENA. The platform collects over 20,000 human-AI chat interactions, including 8,000 votes and detailed feedback. Analysis of these interactions reveals important insights into the performance of top VLMs, including challenges with subtle contextual cues, spatial reasoning, and expert domain knowledge. Current VLMs also face issues with hallucinations and safety when provoked. The platform provides a comprehensive dataset for future research in VLMs. WILDVISION-BENCH is a challenging and natural benchmark that reflects real-world human use cases, with rankings closely aligned with the WILDVISION-ARENA leaderboard. The benchmark achieves a 0.94 Spearman correlation with the WILDVISION-ARENA leaderboard. Analysis of failure cases highlights the limitations of VLMs in recognizing visual context, spatial reasoning, and expert domain knowledge. The platform also provides a live leaderboard and failure case analysis to track recent advancements in VLMs. The research contributes to the development of more effective and safe VLMs by addressing the gap between benchmark metrics and human preferences.
Reach us at info@study.space