MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenges in multimodal long-document question answering, where increased interaction often leads to contextual entanglement, dilution of critical evidence, and noise accumulation in multi-hop reasoning. To mitigate these issues, the authors propose an agent collaboration framework grounded in structured memory, decomposing the task into three specialized modules: an Explorer performing multi-granularity multimodal retrieval, a Refiner constructing structured evidence and reasoning memory, and a Reflector evaluating evidence sufficiency to drive iterative refinement. By decoupling retrieval, refinement, and reflection mechanisms, the approach effectively alleviates context bloat and noise while preserving key facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench demonstrate significant improvements over same-backbone baselines, validating the efficacy of structured memory in multimodal long-document QA.

📝 Abstract

Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

Problem

Research questions and friction points this paper is trying to address.

multimodal long-document QA

iterative retrieval-reasoning

context noise

multi-hop reasoning

evidence dilution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-Aware

Structured Memory

Multimodal QA