CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

2024 | Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren & Yejin Choi
This paper presents CULTURE-GEN, a dataset of generations on 8 culture-related topics for 110 countries and regions, generated by three state-of-the-art language models: gpt-4, llama2-13b, and mistral-7b. The dataset is created by prompting the models with natural language instructions to generate text on various cultural topics, and then extracting culture symbols from the generated text. These symbols are entities that the models associate with specific cultures. The paper analyzes the cultural perceptions of these models, revealing that they tend to favor Western cultures and use linguistic markers to distinguish marginalized cultures from default cultures. The study also finds that the diversity of culture symbols varies across geographic regions and that the presence of culture symbols in culture-agnostic generations is uneven. The paper highlights the importance of studying the fairness and knowledge of global culture perception in large language models, and suggests future research directions, including the use of open-source models with open training data and the exploration of the effects of different training components on cultural perception. The findings indicate that current language models have uneven cultural perceptions and inadequate cultural knowledge, particularly regarding marginalized cultures, and that mitigating cultural biases requires expanding the coverage of pretraining and instruction-tuning data to include global cultures and adopting pluralistic alignment approaches. The paper also discusses the limitations of the study, including the focus on English cultural generations and the potential impact of multilingual training on cultural relevance. The dataset and code are publicly available for further research.This paper presents CULTURE-GEN, a dataset of generations on 8 culture-related topics for 110 countries and regions, generated by three state-of-the-art language models: gpt-4, llama2-13b, and mistral-7b. The dataset is created by prompting the models with natural language instructions to generate text on various cultural topics, and then extracting culture symbols from the generated text. These symbols are entities that the models associate with specific cultures. The paper analyzes the cultural perceptions of these models, revealing that they tend to favor Western cultures and use linguistic markers to distinguish marginalized cultures from default cultures. The study also finds that the diversity of culture symbols varies across geographic regions and that the presence of culture symbols in culture-agnostic generations is uneven. The paper highlights the importance of studying the fairness and knowledge of global culture perception in large language models, and suggests future research directions, including the use of open-source models with open training data and the exploration of the effects of different training components on cultural perception. The findings indicate that current language models have uneven cultural perceptions and inadequate cultural knowledge, particularly regarding marginalized cultures, and that mitigating cultural biases requires expanding the coverage of pretraining and instruction-tuning data to include global cultures and adopting pluralistic alignment approaches. The paper also discusses the limitations of the study, including the focus on English cultural generations and the potential impact of multilingual training on cultural relevance. The dataset and code are publicly available for further research.
Reach us at info@study.space