CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

20 Aug 2024 | Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren & Yejin Choi
The paper "CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting" by Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastian Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren, and Yejin Choi explores the cultural perceptions of three state-of-the-art (SOTA) language models (LLMs) on 110 countries and regions across 8 culture-related topics. The authors use culture-conditioned generations to extract symbols associated with each culture and analyze the diversity and fairness of these symbols. Key findings include: 1. **Cultural Markedness**: LLMs tend to use linguistic markers, such as the word "traditional" and parenthesized explanations, to distinguish marginalized cultures from default cultures. This "markedness" is more pronounced in certain geographic regions, particularly those with Eastern European, African-Islamic, and Middle Eastern cultures. 2. **Diversity of Culture Symbols**: The diversity of culture symbols varies significantly among geographic regions, indicating that some marginalized cultures are underrepresented in LLMs' knowledge. The diversity of symbols is moderately to strongly correlated with the co-occurrence frequency of culture names and topic-related keywords in training data. 3. **Default Culture Symbols**: In culture-agnostic generations, LLMs show a higher presence of West European, English-speaking, and Nordic cultures, suggesting a bias towards these regions. 4. **Training Data Impact**: The frequency of a culture appearing in training data significantly affects the diversity of culture symbols in LLMs. Models trained on more diverse training data tend to generate a wider range of culture symbols. 5. **Future Directions**: The authors suggest that improving cultural fairness in LLMs requires expanding the coverage of pretraining and instruction-tuning data to include global cultures and implementing pluralistic alignment techniques. The study highlights the need for further research to address cultural biases and ensure fair and diverse cultural representation in LLMs.The paper "CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting" by Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastian Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren, and Yejin Choi explores the cultural perceptions of three state-of-the-art (SOTA) language models (LLMs) on 110 countries and regions across 8 culture-related topics. The authors use culture-conditioned generations to extract symbols associated with each culture and analyze the diversity and fairness of these symbols. Key findings include: 1. **Cultural Markedness**: LLMs tend to use linguistic markers, such as the word "traditional" and parenthesized explanations, to distinguish marginalized cultures from default cultures. This "markedness" is more pronounced in certain geographic regions, particularly those with Eastern European, African-Islamic, and Middle Eastern cultures. 2. **Diversity of Culture Symbols**: The diversity of culture symbols varies significantly among geographic regions, indicating that some marginalized cultures are underrepresented in LLMs' knowledge. The diversity of symbols is moderately to strongly correlated with the co-occurrence frequency of culture names and topic-related keywords in training data. 3. **Default Culture Symbols**: In culture-agnostic generations, LLMs show a higher presence of West European, English-speaking, and Nordic cultures, suggesting a bias towards these regions. 4. **Training Data Impact**: The frequency of a culture appearing in training data significantly affects the diversity of culture symbols in LLMs. Models trained on more diverse training data tend to generate a wider range of culture symbols. 5. **Future Directions**: The authors suggest that improving cultural fairness in LLMs requires expanding the coverage of pretraining and instruction-tuning data to include global cultures and implementing pluralistic alignment techniques. The study highlights the need for further research to address cultural biases and ensure fair and diverse cultural representation in LLMs.
Reach us at info@study.space
Understanding CULTURE-GEN%3A Revealing Global Cultural Perception in Language Models through Natural Language Prompting