IndoCulture is a novel dataset designed to evaluate the influence of geographical factors on language model reasoning ability, specifically focusing on the diverse cultures of eleven Indonesian provinces. Unlike previous studies that primarily relied on English datasets and templates, IndoCulture was manually developed by local experts in each province based on predefined topics. The dataset covers 12 fine-grained cultural topics and includes 2,429 instances, with rigorous quality control measures in place.
The study assesses the performance of 23 language models, including open-source multilingual models, Indonesian-centric models, and closed-source models. Key findings include:
1. Even the best open-source models struggle with an accuracy of 53.2%.
2. Models often provide more accurate predictions for specific provinces, such as Bali and West Java.
3. Inclusion of location contexts enhances performance, especially in larger models like GPT-4.
The research highlights the importance of geographical context in commonsense reasoning and suggests that large language models need to be better equipped to handle cultural diversity. The study also discusses the limitations and ethical considerations of the dataset, emphasizing the need for future research to explore temporal aspects and expand geographical coverage.IndoCulture is a novel dataset designed to evaluate the influence of geographical factors on language model reasoning ability, specifically focusing on the diverse cultures of eleven Indonesian provinces. Unlike previous studies that primarily relied on English datasets and templates, IndoCulture was manually developed by local experts in each province based on predefined topics. The dataset covers 12 fine-grained cultural topics and includes 2,429 instances, with rigorous quality control measures in place.
The study assesses the performance of 23 language models, including open-source multilingual models, Indonesian-centric models, and closed-source models. Key findings include:
1. Even the best open-source models struggle with an accuracy of 53.2%.
2. Models often provide more accurate predictions for specific provinces, such as Bali and West Java.
3. Inclusion of location contexts enhances performance, especially in larger models like GPT-4.
The research highlights the importance of geographical context in commonsense reasoning and suggests that large language models need to be better equipped to handle cultural diversity. The study also discusses the limitations and ethical considerations of the dataset, emphasizing the need for future research to explore temporal aspects and expand geographical coverage.