🤖 AI Summary
Existing benchmarks struggle to simultaneously evaluate AI systems’ capabilities in multi-turn dialogue memory and deep reasoning over long documents. This work proposes the first unified evaluation paradigm, introducing a synthetic benchmark comprising 50 microworlds and 1,000 question-answer pairs that integrate multi-character interactions, cross-month event graphs, authentic lengthy legal documents, and multi-turn dialogues. Notably, 75.1% of the questions are Hybrid, requiring models to leverage dialogue history to locate relevant document segments and perform complex reasoning. Through synthetic data generation, LLM-as-judge validation (Cohen’s κ = 0.634), and evaluation across multiple baselines—including RAG and long-context LLMs—the best-performing system, RAG-Both, achieves only 0.358 overall F1 and 0.342 Hybrid F1, revealing a significant gap in current models’ ability to jointly handle retrieval and reasoning, thereby underscoring the benchmark’s challenge and necessity.
📝 Abstract
AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.