SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

5 Aug 2024 | Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, Jingren Zhou
**SocialBench: Sociality Evaluation of Role-Playing Conversational Agents** This paper introduces SocialBench, a benchmark designed to systematically evaluate the social intelligence of role-playing conversational agents at both individual and group levels. The benchmark covers a wide range of 500 characters, over 6,000 questions, and 30,800 multi-turn role-playing utterances. It assesses various aspects of social interaction, including self-awareness on role description, emotional perception, long-term conversation memory, and social preference towards group dynamics. **Key Findings:** - Agents excelling in individual tasks do not necessarily perform well in group tasks. - Agents' behavior can drift due to the influence of other agents within the group. - The benchmark confirms its significance as a testbed for evaluating the social interaction of role-playing conversational agents. **Dataset Construction:** - **Profile Collection:** Diverse role profiles are collected from various sources, including novels, scripts, online platforms, and automatic generation using GPT-4-Turbo. - **Dialogue Construction:** Four methods are used to construct dialogues: extracting from novels and scripts, collecting from online platforms, conducting role-playing tasks between users and LLMs, and generating self-dialogues with LLMs. - **Question Design:** Questions are designed for self-awareness, emotional perception, conversation memory, and social preference, ensuring both multiple-choice and open-domain generation questions. **Evaluation:** - Over 10 mainstream LLMs are evaluated on SocialBench, including open-source and closed-source models. - Results show that closed-source models generally outperform open-source models, and models specifically designed for role-playing perform well. - However, most models struggle with complex group dynamics and exhibit preference drift under different group polarities. **Conclusion:** SocialBench provides a comprehensive evaluation framework for assessing the social intelligence of role-playing conversational agents. While agents perform satisfactorily at the individual level, their social interaction capabilities at the group level remain limited. Future research should address these limitations to enhance the social intelligence of role-playing agents.**SocialBench: Sociality Evaluation of Role-Playing Conversational Agents** This paper introduces SocialBench, a benchmark designed to systematically evaluate the social intelligence of role-playing conversational agents at both individual and group levels. The benchmark covers a wide range of 500 characters, over 6,000 questions, and 30,800 multi-turn role-playing utterances. It assesses various aspects of social interaction, including self-awareness on role description, emotional perception, long-term conversation memory, and social preference towards group dynamics. **Key Findings:** - Agents excelling in individual tasks do not necessarily perform well in group tasks. - Agents' behavior can drift due to the influence of other agents within the group. - The benchmark confirms its significance as a testbed for evaluating the social interaction of role-playing conversational agents. **Dataset Construction:** - **Profile Collection:** Diverse role profiles are collected from various sources, including novels, scripts, online platforms, and automatic generation using GPT-4-Turbo. - **Dialogue Construction:** Four methods are used to construct dialogues: extracting from novels and scripts, collecting from online platforms, conducting role-playing tasks between users and LLMs, and generating self-dialogues with LLMs. - **Question Design:** Questions are designed for self-awareness, emotional perception, conversation memory, and social preference, ensuring both multiple-choice and open-domain generation questions. **Evaluation:** - Over 10 mainstream LLMs are evaluated on SocialBench, including open-source and closed-source models. - Results show that closed-source models generally outperform open-source models, and models specifically designed for role-playing perform well. - However, most models struggle with complex group dynamics and exhibit preference drift under different group polarities. **Conclusion:** SocialBench provides a comprehensive evaluation framework for assessing the social intelligence of role-playing conversational agents. While agents perform satisfactorily at the individual level, their social interaction capabilities at the group level remain limited. Future research should address these limitations to enhance the social intelligence of role-playing agents.
Reach us at info@study.space