10 Apr 2024 | Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi
**CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge**
**Abstract:**
Large language models (LLMs) often lack multicultural knowledge due to their development by researchers and practitioners with skewed cultural backgrounds. Current methods for assessing LLMs' multicultural knowledge, such as human annotations or outdated internet resources, struggle to capture the complexity and diversity of cultural norms. To address this, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation datasets for assessing LLMs' multicultural knowledge. Our study reveals that CulturalTeaming's AI assistance supports annotators in creating more difficult and creative cultural questions, leading to a dataset, CULTURALBENCH-V0.1, with 252 carefully reviewed questions across 34 distinct cultures. The dataset shows a notable gap in LLMs' multicultural proficiency, with performances ranging from 37.7% to 72.2%.
**Introduction:**
The uneven cultural representation in LLMs poses a significant challenge in assessing their multicultural knowledge. Traditional human-written benchmarks are static and fail to keep up with LLMs' evolving capabilities. Automatic benchmark creation using online resources or socio-cultural surveys also faces limitations. We introduce CulturalTeaming, a system that guides users in creating challenging evaluation datasets through human-AI collaboration. We implement two variants: Verifier-Only and AI-Assisted, with the latter providing more extensive AI assistance. User studies reveal that AI assistance enhances the creation of difficult questions and boosts annotators' creativity.
**System Overview:**
CulturalTeaming consists of three steps: Question Formulation, Question Verification and Revision, and Feedback Collection. Users brainstorm culturally relevant scenarios, draft multiple-choice questions (MCQs), revise them iteratively, and provide feedback. The system supports different levels of AI assistance, from minimal to extensive, to optimize cognitive resources and improve data quality.
**Results:**
Our dataset, CULTURALBENCH-V0.1, includes 252 carefully reviewed MCQs across 34 cultures. Model performances range from 37.7% to 72.2%, highlighting a significant gap in LLMs' multicultural proficiency. Qualitative analysis shows that hard questions require extensive reasoning and subtle distinctions.
**Conclusion:**
CulturalTeaming effectively collects challenging datasets for assessing LLMs' multicultural knowledge. Our findings suggest that human-AI collaboration can enhance the creation of high-quality, challenging datasets, improving the evaluation of LLMs' cultural awareness.**CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge**
**Abstract:**
Large language models (LLMs) often lack multicultural knowledge due to their development by researchers and practitioners with skewed cultural backgrounds. Current methods for assessing LLMs' multicultural knowledge, such as human annotations or outdated internet resources, struggle to capture the complexity and diversity of cultural norms. To address this, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation datasets for assessing LLMs' multicultural knowledge. Our study reveals that CulturalTeaming's AI assistance supports annotators in creating more difficult and creative cultural questions, leading to a dataset, CULTURALBENCH-V0.1, with 252 carefully reviewed questions across 34 distinct cultures. The dataset shows a notable gap in LLMs' multicultural proficiency, with performances ranging from 37.7% to 72.2%.
**Introduction:**
The uneven cultural representation in LLMs poses a significant challenge in assessing their multicultural knowledge. Traditional human-written benchmarks are static and fail to keep up with LLMs' evolving capabilities. Automatic benchmark creation using online resources or socio-cultural surveys also faces limitations. We introduce CulturalTeaming, a system that guides users in creating challenging evaluation datasets through human-AI collaboration. We implement two variants: Verifier-Only and AI-Assisted, with the latter providing more extensive AI assistance. User studies reveal that AI assistance enhances the creation of difficult questions and boosts annotators' creativity.
**System Overview:**
CulturalTeaming consists of three steps: Question Formulation, Question Verification and Revision, and Feedback Collection. Users brainstorm culturally relevant scenarios, draft multiple-choice questions (MCQs), revise them iteratively, and provide feedback. The system supports different levels of AI assistance, from minimal to extensive, to optimize cognitive resources and improve data quality.
**Results:**
Our dataset, CULTURALBENCH-V0.1, includes 252 carefully reviewed MCQs across 34 cultures. Model performances range from 37.7% to 72.2%, highlighting a significant gap in LLMs' multicultural proficiency. Qualitative analysis shows that hard questions require extensive reasoning and subtle distinctions.
**Conclusion:**
CulturalTeaming effectively collects challenging datasets for assessing LLMs' multicultural knowledge. Our findings suggest that human-AI collaboration can enhance the creation of high-quality, challenging datasets, improving the evaluation of LLMs' cultural awareness.