🤖 AI Summary
Existing approaches to long-document question answering struggle with high costs of organizing irrelevant knowledge, underutilization of original document structure, and difficulty in reusing prior reasoning experiences within structured retrieval-augmented generation (RAG) frameworks. This work proposes DocTrace, a multi-agent RAG framework that innovatively integrates query-triggered knowledge organization, document structure-aware reasoning, and experience-guided reasoning reuse. Specifically, DocTrace employs a lightweight document structure tree index to preserve hierarchical information, dynamically constructs a shared hypergraph-based working memory on demand, and stores successful reasoning paths in a graph-structured experience memory for future reuse. Evaluated on four long-document QA benchmarks, DocTrace significantly outperforms state-of-the-art methods, achieving gains of up to 8.85% in F1 and 4.40% in exact match (EM), while reducing overall computational cost by 53.32%.
📝 Abstract
Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. Although retrieval-augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query-agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi-agent RAG framework for long-document QA that supports query-triggered knowledge organization, document-structure-aware and experience-guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent-shared hypergraph-structured working memory on demand during reasoning, and stores successful reasoning plans in graph-structured experience memory for future reuse, enabling adaptive exploration across related long-document questions. Experiments on four long-document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%