[slides and audio] Evaluating Very Long-Term Conversational Memory of LLM Agents

This paper addresses the gap in evaluating long-term open-domain dialogues by introducing a machine-human pipeline to generate high-quality, very long-term dialogues. The pipeline leverages LLM-based agent architectures, grounding dialogues on personas and temporal event graphs, and enabling image sharing and reaction. Human annotators verify and edit the generated conversations for consistency and grounding to event graphs. The resulting dataset, LoCoMo, consists of 50 very long-term dialogues, each with 300 turns and 9K tokens over up to 35 sessions. The paper presents a comprehensive evaluation benchmark to measure long-term memory in models, including question answering, event summarization, and multi-modal dialogue generation tasks. Experimental results show that LLMs struggle with understanding lengthy conversations and comprehending long-range temporal and causal dynamics, despite improvements from long-context LLMs and RAG techniques. The study highlights the need for further research to improve models' performance in very long-term dialogues.This paper addresses the gap in evaluating long-term open-domain dialogues by introducing a machine-human pipeline to generate high-quality, very long-term dialogues. The pipeline leverages LLM-based agent architectures, grounding dialogues on personas and temporal event graphs, and enabling image sharing and reaction. Human annotators verify and edit the generated conversations for consistency and grounding to event graphs. The resulting dataset, LoCoMo, consists of 50 very long-term dialogues, each with 300 turns and 9K tokens over up to 35 sessions. The paper presents a comprehensive evaluation benchmark to measure long-term memory in models, including question answering, event summarization, and multi-modal dialogue generation tasks. Experimental results show that LLMs struggle with understanding lengthy conversations and comprehending long-range temporal and causal dynamics, despite improvements from long-context LLMs and RAG techniques. The study highlights the need for further research to improve models' performance in very long-term dialogues.

Evaluating Very Long-Term Conversational Memory of LLM Agents

27 Feb 2024 | Adyasha Maharana, Mohit Bansal, Dong-Ho Lee, Francesco Barbieri, Sergey Tulyakov, Yuwei Fang