🤖 AI Summary
Existing systems struggle to perform cross-document quantitative analysis and synthetic reasoning over large-scale semi-structured documents to answer complex questions. This work formally defines the multi-document analytical question answering task and introduces MuDABench, a benchmark comprising over 80,000 document pages and 332 question-answer instances, automatically annotated via distant supervision leveraging metadata and financial databases. The authors propose a multi-agent collaborative workflow that integrates planning, information extraction, and code generation, moving beyond conventional flat retrieval paradigms. Evaluation employs dual metrics—intermediate fact coverage and answer accuracy—to assess performance. Experiments demonstrate that the proposed approach significantly outperforms standard RAG systems, yet still lags behind human experts in single-document extraction precision and domain-specific knowledge.
📝 Abstract
This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.