CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

9 Jan 2024 | Quan Tu*, Shilong Fan*, Zihang Tian*, Rui Yan*
The paper introduces *CharacterEval*, a comprehensive benchmark for evaluating Role-Playing Conversational Agents (RPCAs) in Chinese. The benchmark includes a high-quality dataset of 1,785 multi-turn role-playing dialogues, featuring 77 characters from Chinese novels and scripts. The dataset was constructed through a rigorous process involving GPT-4 for dialogue extraction, human-led quality control, and detailed character profiles sourced from Baidu Baike. *CharacterEval* employs a multifaceted evaluation approach with thirteen specific metrics across four dimensions: conversational ability, character consistency, role-playing attractiveness, and personality back-testing. To facilitate the evaluation of subjective metrics, the authors developed CharacterRM, a role-playing reward model based on human annotations, which shows higher correlation with human judgment compared to GPT-4. Experimental results demonstrate that Chinese LLMs outperform GPT-4 in Chinese role-playing conversations, highlighting the potential of Chinese LLMs in this domain. The paper also discusses related work, problem formulation, data collection, evaluation metrics, and detailed experimental results, concluding with a robustness analysis and a discussion of the limitations and future directions.The paper introduces *CharacterEval*, a comprehensive benchmark for evaluating Role-Playing Conversational Agents (RPCAs) in Chinese. The benchmark includes a high-quality dataset of 1,785 multi-turn role-playing dialogues, featuring 77 characters from Chinese novels and scripts. The dataset was constructed through a rigorous process involving GPT-4 for dialogue extraction, human-led quality control, and detailed character profiles sourced from Baidu Baike. *CharacterEval* employs a multifaceted evaluation approach with thirteen specific metrics across four dimensions: conversational ability, character consistency, role-playing attractiveness, and personality back-testing. To facilitate the evaluation of subjective metrics, the authors developed CharacterRM, a role-playing reward model based on human annotations, which shows higher correlation with human judgment compared to GPT-4. Experimental results demonstrate that Chinese LLMs outperform GPT-4 in Chinese role-playing conversations, highlighting the potential of Chinese LLMs in this domain. The paper also discusses related work, problem formulation, data collection, evaluation metrics, and detailed experimental results, concluding with a robustness analysis and a discussion of the limitations and future directions.
Reach us at info@study.space