SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

5 Aug 2024 | Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, Jingren Zhou
SocialBench is the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels. The benchmark is constructed from diverse English and Chinese books, movies, and novels, covering a wide range of 500 characters, 6,000 questions, and 30,800 multi-turn role-playing utterances. It includes various dimensions such as self-awareness on role description, emotional perception on environment, long-term conversation memory, and social preference towards group dynamics. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench. The benchmark is constructed through a three-step pipeline: profile collection, dialogue construction, and question design. Profiles are collected from various sources, including novels, scripts, online platforms, and automatic generation via GPT-4Turbo prompting. Dialogue construction involves extracting dialogues from novels and scripts, collecting user dialogue data from online platforms, conducting role-playing tasks between users and general LLMs, and generating self-dialogue with general LLMs. Question design includes multiple dimensions such as self-awareness, emotional perception, conversation memory, and social preference. The benchmark is validated through rigorous manual screening, annotation, and refinement. It includes multiple validation strategies for different dimensions, ensuring the quality and validity of the dataset. The benchmark is evaluated using mainstream open-source and closed-source LLMs, including LLaMA-2, Mistral-7B, Qwen, Minimax, GLM, Baichuan, and GPT-4-Turbo. The results show that closed-source models generally outperform open-source models, and models specifically designed for role-playing, such as Xingchen-Plus, perform better. However, role-playing agents tend to underperform compared to their general counterparts, especially in complex group dynamics. The benchmark highlights the importance of social intelligence in role-playing agents, including individual and group levels. The results indicate that while role-playing agents demonstrate satisfactory performance at the individual level, their social interaction capabilities at the group level remain deficient. The benchmark provides a comprehensive evaluation framework for assessing the sociality of role-playing conversation agents, highlighting areas for future research and development.SocialBench is the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels. The benchmark is constructed from diverse English and Chinese books, movies, and novels, covering a wide range of 500 characters, 6,000 questions, and 30,800 multi-turn role-playing utterances. It includes various dimensions such as self-awareness on role description, emotional perception on environment, long-term conversation memory, and social preference towards group dynamics. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench. The benchmark is constructed through a three-step pipeline: profile collection, dialogue construction, and question design. Profiles are collected from various sources, including novels, scripts, online platforms, and automatic generation via GPT-4Turbo prompting. Dialogue construction involves extracting dialogues from novels and scripts, collecting user dialogue data from online platforms, conducting role-playing tasks between users and general LLMs, and generating self-dialogue with general LLMs. Question design includes multiple dimensions such as self-awareness, emotional perception, conversation memory, and social preference. The benchmark is validated through rigorous manual screening, annotation, and refinement. It includes multiple validation strategies for different dimensions, ensuring the quality and validity of the dataset. The benchmark is evaluated using mainstream open-source and closed-source LLMs, including LLaMA-2, Mistral-7B, Qwen, Minimax, GLM, Baichuan, and GPT-4-Turbo. The results show that closed-source models generally outperform open-source models, and models specifically designed for role-playing, such as Xingchen-Plus, perform better. However, role-playing agents tend to underperform compared to their general counterparts, especially in complex group dynamics. The benchmark highlights the importance of social intelligence in role-playing agents, including individual and group levels. The results indicate that while role-playing agents demonstrate satisfactory performance at the individual level, their social interaction capabilities at the group level remain deficient. The benchmark provides a comprehensive evaluation framework for assessing the sociality of role-playing conversation agents, highlighting areas for future research and development.
Reach us at info@study.space