S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of efficiently retrieving sparse evidence from heterogeneous interaction histories in long-term interactive question answering, where existing memory interfaces struggle to balance compactness and reasoning capability. The authors propose S3Mem, a query-time structured memory interface that encodes textual, visual, and agent histories into scene-event units and introduces a multi-granularity routing mechanism—operating at the levels of units, query anchors, and anchor-support links—to enable both single-hop selection and short multi-hop reasoning. Notably, S3Mem generates compact evidence packages without requiring reader fine-tuning. Evaluated on LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem drastically reduces evidence token counts (e.g., to 1,073 tokens on LoCoMo, a 15.8× compression) while maintaining or improving F1 and BLEU scores.

📝 Abstract

Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory-to-answer interface of long-term memory. When histories are stored as plain-text chunks and queried with standard retrieval-augmented generation (RAG), systems often retrieve locally relevant but chain-incomplete evidence, especially for spatial, temporal, repeated-event, and multi-hop state questions. We propose S3MEM, a structured scene-event episodic memory framework for long-horizon interactive question answering (QA). S3MEM writes trajectories into structured memory units, retrieves evidence through anchor-sensitive retrieval, and exposes a compact token-budget-aware evidence interface for answer-time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query-aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out-of-family environments (SciWorld, ALFWorld). Under a shared frozen answer-time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph-NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines -- A-MEM-inspired, MemoryOS-adapted, and LightMem-adapted -- improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy-efficiency frontier. Overall, the evidence supports a bounded conclusion: under the current frozen answer-time protocol, structured writing and anchor-sensitive evidence routing provide a stronger accuracy-efficiency frontier for long-horizon interactive QA than more generic memory interfaces.

Problem

Research questions and friction points this paper is trying to address.

long-horizon question answering

memory interface

structured memory

evidence selection

spatiotemporal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured memory

spatiotemporal reasoning

evidence routing