MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing benchmarks struggle to simultaneously evaluate AI systems’ capabilities in multi-turn dialogue memory and deep reasoning over long documents. This work proposes the first unified evaluation paradigm, introducing a synthetic benchmark comprising 50 microworlds and 1,000 question-answer pairs that integrate multi-character interactions, cross-month event graphs, authentic lengthy legal documents, and multi-turn dialogues. Notably, 75.1% of the questions are Hybrid, requiring models to leverage dialogue history to locate relevant document segments and perform complex reasoning. Through synthetic data generation, LLM-as-judge validation (Cohen’s κ = 0.634), and evaluation across multiple baselines—including RAG and long-context LLMs—the best-performing system, RAG-Both, achieves only 0.358 overall F1 and 0.342 Hybrid F1, revealing a significant gap in current models’ ability to jointly handle retrieval and reasoning, thereby underscoring the benchmark’s challenge and necessity.

📝 Abstract

AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

Problem

Research questions and friction points this paper is trying to address.

conversational memory

long document reasoning

joint retrieval

hybrid question answering

benchmark dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational memory

long-document reasoning

hybrid retrieval