TIMECHARA: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

TIMECHARA: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

28 May 2024 | Jaewoo Ahn, Taehyun Lee, Junyoung Lim, Jin-Hwa Kim, Sangdo Yun, Hwaran Lee, Gunhee Kim
TIMECHARA is a new benchmark for evaluating point-in-time character hallucination in role-playing large language models (LLMs). The benchmark consists of 10,895 instances generated through an automated pipeline, highlighting significant hallucination issues in current state-of-the-art LLMs like GPT-4o. To address this, the paper proposes NARRATIVE-EXPERTS, a method that decomposes reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations. The benchmark focuses on evaluating spatiotemporal consistency and personality consistency of role-playing agents. The dataset is constructed by selecting four popular novel series and generating interview questions tailored to each character at a specific point in their story, along with spatiotemporal labels to determine the consistency of their responses. The paper also presents experimental results showing that even advanced LLMs struggle with point-in-time character hallucination, and NARRATIVE-EXPERTS significantly reduces hallucination and improves spatiotemporal consistency. Despite these efforts, the findings indicate ongoing challenges with point-in-time character hallucination, suggesting the need for further improvements. The paper also discusses limitations, including cultural biases in the dataset and high costs of GPT-4 judges, and addresses ethical concerns related to copyright and harmful content. The research contributes to the field of role-playing LLMs by introducing a new benchmark and a method to reduce hallucination.TIMECHARA is a new benchmark for evaluating point-in-time character hallucination in role-playing large language models (LLMs). The benchmark consists of 10,895 instances generated through an automated pipeline, highlighting significant hallucination issues in current state-of-the-art LLMs like GPT-4o. To address this, the paper proposes NARRATIVE-EXPERTS, a method that decomposes reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations. The benchmark focuses on evaluating spatiotemporal consistency and personality consistency of role-playing agents. The dataset is constructed by selecting four popular novel series and generating interview questions tailored to each character at a specific point in their story, along with spatiotemporal labels to determine the consistency of their responses. The paper also presents experimental results showing that even advanced LLMs struggle with point-in-time character hallucination, and NARRATIVE-EXPERTS significantly reduces hallucination and improves spatiotemporal consistency. Despite these efforts, the findings indicate ongoing challenges with point-in-time character hallucination, suggesting the need for further improvements. The paper also discusses limitations, including cultural biases in the dataset and high costs of GPT-4 judges, and addresses ethical concerns related to copyright and harmful content. The research contributes to the field of role-playing LLMs by introducing a new benchmark and a method to reduce hallucination.
Reach us at info@study.space
[slides] TimeChara%3A Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models | StudySpace