10 Apr 2024 | Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi
CulturalTeaming is an AI-assisted interactive red-teaming system designed to create challenging evaluation datasets for assessing large language models (LLMs)' multicultural knowledge. The system leverages human-AI collaboration to generate culturally relevant questions that test LLMs' understanding of diverse cultural norms. Through a series of 1-hour workshops, the system collects user-generated questions and feedback to build the CULTURALBENCH-V0.1 dataset, which includes 252 carefully reviewed multiple-choice questions spanning 34 distinct cultures. This dataset reveals significant gaps in LLMs' multicultural proficiency, with performance ranging from 37.7% to 72.2% across various LLM families.
The system offers two variants: Verifier-Only and AI-Assisted. The Verifier-Only variant involves users manually creating and revising questions, while the AI-Assisted variant uses LLMs to draft questions and provide revision hints. Users with AI assistance create more challenging questions and report higher levels of creativity and satisfaction. The AI-Assisted variant also allows users to generate more difficult questions through iterative revision and LLM-generated hints, leading to better performance in testing LLMs' cultural knowledge.
The CULTURALBENCH-V0.1 dataset demonstrates that LLMs struggle with culturally nuanced questions, particularly those requiring deep reasoning and understanding of cultural context. The dataset highlights the importance of incorporating cultural knowledge into LLM evaluations and the potential of AI-assisted methods in creating more effective benchmarks. The study also shows that users with AI assistance are more engaged and produce higher quality data, indicating the value of human-AI collaboration in data annotation. The findings underscore the need for further research into culturally aware LLM evaluations and the development of more comprehensive benchmarks to assess LLMs' multicultural proficiency.CulturalTeaming is an AI-assisted interactive red-teaming system designed to create challenging evaluation datasets for assessing large language models (LLMs)' multicultural knowledge. The system leverages human-AI collaboration to generate culturally relevant questions that test LLMs' understanding of diverse cultural norms. Through a series of 1-hour workshops, the system collects user-generated questions and feedback to build the CULTURALBENCH-V0.1 dataset, which includes 252 carefully reviewed multiple-choice questions spanning 34 distinct cultures. This dataset reveals significant gaps in LLMs' multicultural proficiency, with performance ranging from 37.7% to 72.2% across various LLM families.
The system offers two variants: Verifier-Only and AI-Assisted. The Verifier-Only variant involves users manually creating and revising questions, while the AI-Assisted variant uses LLMs to draft questions and provide revision hints. Users with AI assistance create more challenging questions and report higher levels of creativity and satisfaction. The AI-Assisted variant also allows users to generate more difficult questions through iterative revision and LLM-generated hints, leading to better performance in testing LLMs' cultural knowledge.
The CULTURALBENCH-V0.1 dataset demonstrates that LLMs struggle with culturally nuanced questions, particularly those requiring deep reasoning and understanding of cultural context. The dataset highlights the importance of incorporating cultural knowledge into LLM evaluations and the potential of AI-assisted methods in creating more effective benchmarks. The study also shows that users with AI assistance are more engaged and produce higher quality data, indicating the value of human-AI collaboration in data annotation. The findings underscore the need for further research into culturally aware LLM evaluations and the development of more comprehensive benchmarks to assess LLMs' multicultural proficiency.