16 Feb 2024 | Runcong Zhao*, Qinglin Zhu*, Hainiu Xu, Jiazhen Li, Yuxiang Zhou, Yulan He, Lin Gui
This paper introduces the Conan benchmark, a new dataset for evaluating large language models (LLMs) in understanding complex character relationships in detective narratives. The dataset includes three types of relationships: public, secret, and inferred. Public relationships are widely known, secret relationships are known to only a few, and inferred relationships are deduced from multiple perspectives. The dataset is constructed from detective narratives viewed through multiple character perspectives, with manually annotated relationships. The paper evaluates the performance of advanced LLMs such as GPT-3.5, GPT-4, and Llama2 on this benchmark. The results show that these models struggle with complex relationships, especially in longer narratives. The study highlights the limitations of LLMs in inferential reasoning and efficient information extraction. The dataset is designed to test the cognitive and inferential abilities of LLMs in narrative contexts. The paper also discusses the challenges of character extraction, relation extraction, and the impact of different strategies on performance. The findings suggest that LLMs significantly underperform humans on Conan, and that the performance of models like GPT-4 is limited by the complexity of the input. The study provides insights into improving LLM capabilities in narrative comprehension, simulation agents, and game agents. The paper concludes that the Conan dataset is a valuable resource for evaluating and improving the performance of LLMs in understanding complex relationships in narratives.This paper introduces the Conan benchmark, a new dataset for evaluating large language models (LLMs) in understanding complex character relationships in detective narratives. The dataset includes three types of relationships: public, secret, and inferred. Public relationships are widely known, secret relationships are known to only a few, and inferred relationships are deduced from multiple perspectives. The dataset is constructed from detective narratives viewed through multiple character perspectives, with manually annotated relationships. The paper evaluates the performance of advanced LLMs such as GPT-3.5, GPT-4, and Llama2 on this benchmark. The results show that these models struggle with complex relationships, especially in longer narratives. The study highlights the limitations of LLMs in inferential reasoning and efficient information extraction. The dataset is designed to test the cognitive and inferential abilities of LLMs in narrative contexts. The paper also discusses the challenges of character extraction, relation extraction, and the impact of different strategies on performance. The findings suggest that LLMs significantly underperform humans on Conan, and that the performance of models like GPT-4 is limited by the complexity of the input. The study provides insights into improving LLM capabilities in narrative comprehension, simulation agents, and game agents. The paper concludes that the Conan dataset is a valuable resource for evaluating and improving the performance of LLMs in understanding complex relationships in narratives.