CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

9 Jan 2024 | Quan Tu, Shilong Fan, Zihang Tian, Rui Yan
CharacterEval is a Chinese benchmark for evaluating Role-Playing Conversational Agents (RPCAs). The benchmark includes a high-quality dataset of 1,785 multi-turn role-playing dialogues, containing 11,376 examples and 77 characters derived from Chinese novels and scripts. The dataset was carefully constructed using GPT-4 for initial dialogue extraction, followed by human-led quality control and enhanced with character profiles from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, including thirteen targeted metrics across four dimensions. To facilitate evaluation of these subjective metrics, a role-playing reward model called CharacterRM was developed, which has a higher correlation with human judgment compared to GPT-4. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. The dataset, reward model, and source code are publicly accessible at https://github.com/morecry/CharacterEval. The benchmark aims to provide a comprehensive evaluation framework for assessing the role-playing capabilities of RPCAs, including conversational ability, character consistency, role-playing attractiveness, and personality back-testing. The evaluation system includes twelve metrics in conversational ability, character consistency, and role-playing attractiveness dimensions. The results show that Chinese LLMs perform better than GPT-4 in Chinese role-playing conversations, while GPT-4 performs poorly in Chinese role-playing scenarios. The benchmark also highlights the importance of character consistency and role-playing attractiveness in evaluating RPCAs. The results indicate that future advancements in RPCA development should focus on enhancing capabilities for longer conversational scenarios to ensure more stable and consistent role-playing interactions.CharacterEval is a Chinese benchmark for evaluating Role-Playing Conversational Agents (RPCAs). The benchmark includes a high-quality dataset of 1,785 multi-turn role-playing dialogues, containing 11,376 examples and 77 characters derived from Chinese novels and scripts. The dataset was carefully constructed using GPT-4 for initial dialogue extraction, followed by human-led quality control and enhanced with character profiles from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, including thirteen targeted metrics across four dimensions. To facilitate evaluation of these subjective metrics, a role-playing reward model called CharacterRM was developed, which has a higher correlation with human judgment compared to GPT-4. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. The dataset, reward model, and source code are publicly accessible at https://github.com/morecry/CharacterEval. The benchmark aims to provide a comprehensive evaluation framework for assessing the role-playing capabilities of RPCAs, including conversational ability, character consistency, role-playing attractiveness, and personality back-testing. The evaluation system includes twelve metrics in conversational ability, character consistency, and role-playing attractiveness dimensions. The results show that Chinese LLMs perform better than GPT-4 in Chinese role-playing conversations, while GPT-4 performs poorly in Chinese role-playing scenarios. The benchmark also highlights the importance of character consistency and role-playing attractiveness in evaluating RPCAs. The results indicate that future advancements in RPCA development should focus on enhancing capabilities for longer conversational scenarios to ensure more stable and consistent role-playing interactions.
Reach us at info@study.space