27 Feb 2024 | Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang
This paper introduces LoCoMo, a new dataset of very long-term conversations, each containing 300 turns and 9K tokens on average, over up to 35 sessions. The dataset is generated using a human-machine pipeline that creates conversations based on personas and temporal event graphs, and is then edited by human annotators to ensure consistency and grounding to the event graphs. The paper presents a comprehensive evaluation benchmark to measure long-term memory in models, including question answering, event summarization, and multi-modal dialogue generation tasks. The results show that LLMs struggle with understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. While strategies like long-context LLMs or RAG can offer improvements, these models still substantially lag behind human performance. The paper also proposes a generative pipeline for LoCoMo, including the creation of virtual agents with unique personas, temporal event graphs, and memory and reflection modules. The evaluation framework includes three tasks: question answering, event summarization, and multi-modal dialogue generation. The results show that LLMs face challenges in understanding time concepts within dialogues and in open-domain knowledge questions. The paper also discusses the limitations of the dataset, including hybrid human-machine generated data and limited exploration of multimodal behavior. The paper concludes that LLMs struggle to comprehend long-term narratives within the dialog and fail to draw temporal and causal connections between events discussed by speakers.This paper introduces LoCoMo, a new dataset of very long-term conversations, each containing 300 turns and 9K tokens on average, over up to 35 sessions. The dataset is generated using a human-machine pipeline that creates conversations based on personas and temporal event graphs, and is then edited by human annotators to ensure consistency and grounding to the event graphs. The paper presents a comprehensive evaluation benchmark to measure long-term memory in models, including question answering, event summarization, and multi-modal dialogue generation tasks. The results show that LLMs struggle with understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. While strategies like long-context LLMs or RAG can offer improvements, these models still substantially lag behind human performance. The paper also proposes a generative pipeline for LoCoMo, including the creation of virtual agents with unique personas, temporal event graphs, and memory and reflection modules. The evaluation framework includes three tasks: question answering, event summarization, and multi-modal dialogue generation. The results show that LLMs face challenges in understanding time concepts within dialogues and in open-domain knowledge questions. The paper also discusses the limitations of the dataset, including hybrid human-machine generated data and limited exploration of multimodal behavior. The paper concludes that LLMs struggle to comprehend long-term narratives within the dialog and fail to draw temporal and causal connections between events discussed by speakers.