Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current agent benchmarks are limited to single-session evaluations, making them inadequate for assessing multi-turn interactions that require long-term memory and personalized understanding. This work proposes the first evaluation framework tailored to multi-session service scenarios, integrating cross-session user states, dynamic goals, and temporal dependencies to compel agents to effectively leverage and re-verify historical information during tool use and reasoning. The framework exposes a critical limitation of existing agents: they frequently conflate historical context with the current state and lack mechanisms for state tracking and validation. Experimental results demonstrate that mainstream models perform substantially worse on such tasks, highlighting a significant gap between current capabilities and the demands of real-world, long-term human–agent interaction.

📝 Abstract

Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated preferences, and prior decisions that agents must integrate to fulfill personalized user goals. We introduce Momento, a benchmark for persistent agentic task completion in multi-session service environments, requiring agents to take consequential, tool-mediated actions while resolving temporal dependencies and evolving user goals across sessions. Experimental results reveal that current agents fail primarily through misestimation of user state, treating prior session history as a reliable proxy for current context rather than stale information requiring re-validation, highlighting a substantial gap between current agent capabilities and realistic long-horizon human-agent interaction.

Problem

Research questions and friction points this paper is trying to address.

persistent memory

multi-session conversations

agentic AI

temporal dependencies

user state estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

persistent memory

multi-session reasoning

agentic AI