19 Jun 2024 | Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavana, Siddhant Singh, Ashutosh Dwivedi, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit Choudhury
This survey examines the representation and inclusion of culture in large language models (LLMs) through an analysis of over 90 recent papers. The study highlights that most papers do not explicitly define "culture," instead using proxies such as demographic and semantic aspects to explore cultural representation. These proxies include factors like language, region, religion, and values, which are used to assess how LLMs handle cultural differences. The research identifies several gaps, including the lack of robust probing methods, the limited exploration of semantic domains and aboutness, and the absence of situated studies on the impact of cultural misrepresentation in LLM-based applications. The survey also emphasizes the need for more interdisciplinary approaches and multilingual datasets to better understand and address cultural biases in LLMs. The findings suggest that while some studies focus on values and norms, many aspects of cultural difference remain understudied. The study calls for a more explicit acknowledgment of the link between datasets and cultural aspects, as well as the development of more interpretable and robust methods for evaluating cultural representation in LLMs. The survey concludes that a comprehensive understanding of culture in LLMs requires a broader, more interdisciplinary approach that considers the complex and multifaceted nature of cultural representation.This survey examines the representation and inclusion of culture in large language models (LLMs) through an analysis of over 90 recent papers. The study highlights that most papers do not explicitly define "culture," instead using proxies such as demographic and semantic aspects to explore cultural representation. These proxies include factors like language, region, religion, and values, which are used to assess how LLMs handle cultural differences. The research identifies several gaps, including the lack of robust probing methods, the limited exploration of semantic domains and aboutness, and the absence of situated studies on the impact of cultural misrepresentation in LLM-based applications. The survey also emphasizes the need for more interdisciplinary approaches and multilingual datasets to better understand and address cultural biases in LLMs. The findings suggest that while some studies focus on values and norms, many aspects of cultural difference remain understudied. The study calls for a more explicit acknowledgment of the link between datasets and cultural aspects, as well as the development of more interpretable and robust methods for evaluating cultural representation in LLMs. The survey concludes that a comprehensive understanding of culture in LLMs requires a broader, more interdisciplinary approach that considers the complex and multifaceted nature of cultural representation.