CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies

CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies

23 Apr 2024 | Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Chunhua Yu, Raya Horesh, Rogério Abreu de Paula, Diyi Yang
The paper introduces *CultureBank*, a culturally aware knowledge base constructed from online communities to enhance the cultural awareness of language models. The authors develop a pipeline to extract and process cultural descriptors from TikTok and Reddit, resulting in *CultureBank* with 12K and 11K descriptors, respectively. Each descriptor includes a grounded scenario, persona, and question for evaluation. The pipeline addresses limitations of existing cultural knowledge resources by capturing diverse views and providing contextualized scenarios. Experiments show that fine-tuning language models on *CultureBank* improves their performance on cultural tasks, demonstrating the effectiveness of the pipeline. The paper also offers recommendations for future culturally aware language technologies, emphasizing the importance of diverse data sources, multifaceted data content, and ethical considerations in handling cultural data.The paper introduces *CultureBank*, a culturally aware knowledge base constructed from online communities to enhance the cultural awareness of language models. The authors develop a pipeline to extract and process cultural descriptors from TikTok and Reddit, resulting in *CultureBank* with 12K and 11K descriptors, respectively. Each descriptor includes a grounded scenario, persona, and question for evaluation. The pipeline addresses limitations of existing cultural knowledge resources by capturing diverse views and providing contextualized scenarios. Experiments show that fine-tuning language models on *CultureBank* improves their performance on cultural tasks, demonstrating the effectiveness of the pipeline. The paper also offers recommendations for future culturally aware language technologies, emphasizing the importance of diverse data sources, multifaceted data content, and ethical considerations in handling cultural data.
Reach us at info@study.space