[slides] Are Large Language Models Capable of Generating Human-Level Narratives%3F

This paper investigates whether large language models (LLMs) can generate human-level narratives by analyzing narrative development and plot progression. The authors introduce a novel computational framework to analyze narratives through three discourse-level aspects: story arcs, turning points, and affective dimensions (arousal and valence). By leveraging expert and automatic annotations, they uncover significant discrepancies between LLM- and human-written stories. Human stories are suspenseful, arousing, and diverse in structure, while LLM stories are homogeneously positive and lack tension. The study measures narrative reasoning skills as a precursor to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. The authors show that explicitly integrating discourse features can enhance storytelling, as demonstrated by over 40% improvements in diversity, suspense, and arousal. The study compares human and AI narratives at the discourse level, revealing that LLMs incorrectly pace their storytelling, leading to flat narratives without suspense. LLMs are biased towards positive endings and lack narrative diversity, favoring story arcs with less inflection and happier endings. Human stories contain more setbacks and negative events, while GPT-4 narratives are much more positive. The study also benchmarks narrative comprehension, finding that LLMs perform below human levels in identifying story arcs and turning points. However, incorporating additional discourse information improves model comprehension. The authors explore whether incorporating discourse aspects into the generation stage enhances machine storytelling. They find that reasoning about turning points improves narrative construction, reducing plot holes and enhancing suspense and emotion provocation. They test three variations of a planning-first approach, finding that incorporating turning points significantly improves narrative suspense and engagement. Additionally, explicit instructions about story arcs improve narrative diversity, as shown by significant improvements in thematic, conflict-type, and character diversity. The study concludes that discourse-aware generation enhances AI's storytelling ability in terms of suspense, emotion engagement, and narrative diversity. The authors hope that their dataset and experimental results will attract wider academic interest in discourse studies and provide insights into better narrative generation, comprehension, and evaluation. The study also highlights the limitations of current LLMs, including their reliance on human annotation and focus on English-based models. Future research could expand to multilingual models and diverse linguistic resources to better understand and predict narrative styles across different cultural and linguistic contexts.This paper investigates whether large language models (LLMs) can generate human-level narratives by analyzing narrative development and plot progression. The authors introduce a novel computational framework to analyze narratives through three discourse-level aspects: story arcs, turning points, and affective dimensions (arousal and valence). By leveraging expert and automatic annotations, they uncover significant discrepancies between LLM- and human-written stories. Human stories are suspenseful, arousing, and diverse in structure, while LLM stories are homogeneously positive and lack tension. The study measures narrative reasoning skills as a precursor to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. The authors show that explicitly integrating discourse features can enhance storytelling, as demonstrated by over 40% improvements in diversity, suspense, and arousal. The study compares human and AI narratives at the discourse level, revealing that LLMs incorrectly pace their storytelling, leading to flat narratives without suspense. LLMs are biased towards positive endings and lack narrative diversity, favoring story arcs with less inflection and happier endings. Human stories contain more setbacks and negative events, while GPT-4 narratives are much more positive. The study also benchmarks narrative comprehension, finding that LLMs perform below human levels in identifying story arcs and turning points. However, incorporating additional discourse information improves model comprehension. The authors explore whether incorporating discourse aspects into the generation stage enhances machine storytelling. They find that reasoning about turning points improves narrative construction, reducing plot holes and enhancing suspense and emotion provocation. They test three variations of a planning-first approach, finding that incorporating turning points significantly improves narrative suspense and engagement. Additionally, explicit instructions about story arcs improve narrative diversity, as shown by significant improvements in thematic, conflict-type, and character diversity. The study concludes that discourse-aware generation enhances AI's storytelling ability in terms of suspense, emotion engagement, and narrative diversity. The authors hope that their dataset and experimental results will attract wider academic interest in discourse studies and provide insights into better narrative generation, comprehension, and evaluation. The study also highlights the limitations of current LLMs, including their reliance on human annotation and focus on English-based models. Future research could expand to multilingual models and diverse linguistic resources to better understand and predict narrative styles across different cultural and linguistic contexts.

Are Large Language Models Capable of Generating Human-Level Narratives?

18 Jul 2024 | Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, Alexander Spangher, Muha0 Chen, Jonathan May, Nanyun Peng