7 May 2024 | Siqi Shen, Lajanugen Logeswaran, Soujanya Poria, Moontae Lee, Honglak Lee, Rada Mihalcea
This paper examines the capabilities and limitations of large language models (LLMs) in understanding cultural commonsense. The authors conduct a comprehensive evaluation using various benchmarks and find that LLMs exhibit significant performance discrepancies when tested on culture-specific commonsense knowledge for different cultures. They also observe that LLMs' general commonsense capability is influenced by cultural context, and the language used to query the models can impact their performance on cultural-related tasks. The study highlights inherent biases in LLMs' cultural understanding and provides insights to develop more culturally aware language models. Key findings include:
1. **Performance Discrepancies**: LLMs perform poorly on questions about Iran and Kenya, indicating a lack of familiarity with these cultures.
2. **Cultural Context Impact**: LLMs tend to associate general commonsense with dominant cultures, such as the United States, and struggle to verify general commonsense with specific cultural contexts.
3. **Language Influence**: The language used to query LLMs significantly affects their performance, with English generally yielding the highest accuracy and other languages leading to up to 20% accuracy drop.
4. **Multilingual Prompting**: Using multiple languages can improve performance, but the native language of a culture does not necessarily help.
The paper offers suggestions for improving LLMs' cultural awareness, including curating more diverse training data and using techniques like Chain of Thought or self-feedback to address cultural biases.This paper examines the capabilities and limitations of large language models (LLMs) in understanding cultural commonsense. The authors conduct a comprehensive evaluation using various benchmarks and find that LLMs exhibit significant performance discrepancies when tested on culture-specific commonsense knowledge for different cultures. They also observe that LLMs' general commonsense capability is influenced by cultural context, and the language used to query the models can impact their performance on cultural-related tasks. The study highlights inherent biases in LLMs' cultural understanding and provides insights to develop more culturally aware language models. Key findings include:
1. **Performance Discrepancies**: LLMs perform poorly on questions about Iran and Kenya, indicating a lack of familiarity with these cultures.
2. **Cultural Context Impact**: LLMs tend to associate general commonsense with dominant cultures, such as the United States, and struggle to verify general commonsense with specific cultural contexts.
3. **Language Influence**: The language used to query LLMs significantly affects their performance, with English generally yielding the highest accuracy and other languages leading to up to 20% accuracy drop.
4. **Multilingual Prompting**: Using multiple languages can improve performance, but the native language of a culture does not necessarily help.
The paper offers suggestions for improving LLMs' cultural awareness, including curating more diverse training data and using techniques like Chain of Thought or self-feedback to address cultural biases.