The paper "CultureLLM: Incorporating Cultural Differences into Large Language Models" addresses the issue of cultural bias in large language models (LLMs), which are often dominated by English corpora and exhibit preferences for Western culture. To tackle this, the authors propose CultureLLM, a cost-effective solution that incorporates cultural differences into LLMs. CultureLLM uses the World Values Survey (WVS) as seed data and generates semantically equivalent training data through semantic data augmentation. This approach ensures that the augmented data retains the original ground-truth opinions from different cultures while introducing diversity in sentence and word styles.
The authors fine-tune both culture-specific and unified LLMs (CultureLLM-One) on 9 cultures, covering both high- and low-resource languages. Extensive experiments on 60 culture-related datasets show that CultureLLM significantly outperforms various benchmarks, such as GPT-3.5 and Gemini Pro, with comparable or even better performance compared to GPT-4. Human studies confirm that the generated samples are semantically equivalent to the original samples, validating the effectiveness of the augmentation method.
Key contributions of the paper include:
1. A cost-effective solution to build culturally aware LLMs.
2. A semantic data augmentation approach to generate high-quality and diverse training data.
3. Extensive experiments demonstrating consistent performance across a wide range of cultures and LLMs.
The paper also discusses related work, including cultural bias in LLMs, data augmentation techniques, and value alignment. It concludes by highlighting the importance of recognizing and valuing cultural differences, emphasizing the need for inclusive technologies and services.The paper "CultureLLM: Incorporating Cultural Differences into Large Language Models" addresses the issue of cultural bias in large language models (LLMs), which are often dominated by English corpora and exhibit preferences for Western culture. To tackle this, the authors propose CultureLLM, a cost-effective solution that incorporates cultural differences into LLMs. CultureLLM uses the World Values Survey (WVS) as seed data and generates semantically equivalent training data through semantic data augmentation. This approach ensures that the augmented data retains the original ground-truth opinions from different cultures while introducing diversity in sentence and word styles.
The authors fine-tune both culture-specific and unified LLMs (CultureLLM-One) on 9 cultures, covering both high- and low-resource languages. Extensive experiments on 60 culture-related datasets show that CultureLLM significantly outperforms various benchmarks, such as GPT-3.5 and Gemini Pro, with comparable or even better performance compared to GPT-4. Human studies confirm that the generated samples are semantically equivalent to the original samples, validating the effectiveness of the augmentation method.
Key contributions of the paper include:
1. A cost-effective solution to build culturally aware LLMs.
2. A semantic data augmentation approach to generate high-quality and diverse training data.
3. Extensive experiments demonstrating consistent performance across a wide range of cultures and LLMs.
The paper also discusses related work, including cultural bias in LLMs, data augmentation techniques, and value alignment. It concludes by highlighting the importance of recognizing and valuing cultural differences, emphasizing the need for inclusive technologies and services.