18 Jul 2024 | Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van Steenkiste, Lisa Anne Hendricks, Karolina Stańczak, Aishwarya Agrawal
This paper introduces CULTURALVQA, a novel benchmark designed to assess the cultural understanding of Vision Language Models (VLMs). The benchmark evaluates VLMs' ability to comprehend diverse cultural contexts through visual question-answering tasks, focusing on 11 countries across 5 continents. The dataset includes 2,378 image-question pairs with 1-5 answers per question, covering various cultural facets such as clothing, food, drinks, rituals, and traditions. The evaluation reveals significant disparities in VLMs' performance across regions, with higher accuracy for North American cultures compared to African-Islamic cultures. Additionally, closed-source models like GPT-4V outperform open-source models, with a 29.78% gap in some countries. The study also highlights that VLMs perform better on tangible cultural concepts like clothing and rituals than on intangible concepts like food and drink. The results underscore the current limitations of VLMs in achieving uniform cultural comprehension and identify areas for improvement. The paper discusses the challenges and ethical considerations in collecting culturally rich data and calls for future work to expand the dataset and develop multilingual versions for greater inclusivity.This paper introduces CULTURALVQA, a novel benchmark designed to assess the cultural understanding of Vision Language Models (VLMs). The benchmark evaluates VLMs' ability to comprehend diverse cultural contexts through visual question-answering tasks, focusing on 11 countries across 5 continents. The dataset includes 2,378 image-question pairs with 1-5 answers per question, covering various cultural facets such as clothing, food, drinks, rituals, and traditions. The evaluation reveals significant disparities in VLMs' performance across regions, with higher accuracy for North American cultures compared to African-Islamic cultures. Additionally, closed-source models like GPT-4V outperform open-source models, with a 29.78% gap in some countries. The study also highlights that VLMs perform better on tangible cultural concepts like clothing and rituals than on intangible concepts like food and drink. The results underscore the current limitations of VLMs in achieving uniform cultural comprehension and identify areas for improvement. The paper discusses the challenges and ethical considerations in collecting culturally rich data and calls for future work to expand the dataset and develop multilingual versions for greater inclusivity.