Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitations of existing long-context dialogue evaluation benchmarks, which focus primarily on explicit factual recall and fail to assess models’ reflective memory—the ability to synthesize high-level semantics from dispersed, multimodal cues. To bridge this gap, the study formally defines and systematically evaluates reflective memory in dialogue, introducing RefMem-Bench, a benchmark comprising 26K annotated samples across eight reflective dimensions and three task formats. Furthermore, the authors propose REMIND, a framework that integrates hierarchical meaning construction and high-level reasoning through question-guided retrieval, salience-aware grounding, abstraction-level supervision, and progressive reflective alignment. Experimental results demonstrate that RefMem-Bench poses a significant challenge to current models, while REMIND consistently improves both answer accuracy and memory recall.

📝 Abstract

Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.

Problem

Research questions and friction points this paper is trying to address.

reflective memory

long-horizon dialogue

benchmark

multimodal cues

latent meaning

Innovation

Methods, ideas, or system contributions that make the work stand out.

reflective memory

long-horizon dialogue

REMIND