16 Feb 2024 | Runcong Zhao, Qinglin Zhu, Hainiu Xu, Jiazheng Li, Yuxiang Zhou, Yulan He, Lin Gui
The paper introduces a new benchmark, *Conan*, designed to evaluate large language models (LLMs) in understanding complex character relationships in detective narratives. The *Conan* dataset includes detective narratives from various characters' perspectives, with manually extracted and annotated role-oriented relationships, encompassing public, secret, and inferred relationships. The authors highlight the limitations of advanced LLMs like GPT-3.5, GPT-4, and Llama2 in inferring complex relationships and handling long narratives. Their experiments reveal that these models struggle due to the complexity of information and the length of narratives. The paper outlines three sub-tasks: character extraction, entity linking, and relation deduction, and evaluates LLMs using three strategies: AllTogether, DirRelation, and PairRelation. The results show that while GPT-4 performs well on single-character perspectives, it struggles with extracting relationships from all characters' perspectives. The paper also discusses the impact of character extraction quality and different relation detection strategies on performance. Finally, the authors propose future directions, including enhancing inferential abilities and optimizing key information management, and highlight potential applications in narrative understanding, interactive agents, and theory of mind tasks.The paper introduces a new benchmark, *Conan*, designed to evaluate large language models (LLMs) in understanding complex character relationships in detective narratives. The *Conan* dataset includes detective narratives from various characters' perspectives, with manually extracted and annotated role-oriented relationships, encompassing public, secret, and inferred relationships. The authors highlight the limitations of advanced LLMs like GPT-3.5, GPT-4, and Llama2 in inferring complex relationships and handling long narratives. Their experiments reveal that these models struggle due to the complexity of information and the length of narratives. The paper outlines three sub-tasks: character extraction, entity linking, and relation deduction, and evaluates LLMs using three strategies: AllTogether, DirRelation, and PairRelation. The results show that while GPT-4 performs well on single-character perspectives, it struggles with extracting relationships from all characters' perspectives. The paper also discusses the impact of character extraction quality and different relation detection strategies on performance. Finally, the authors propose future directions, including enhancing inferential abilities and optimizing key information management, and highlight potential applications in narrative understanding, interactive agents, and theory of mind tasks.