🤖 AI Summary
This work addresses multimodal document question answering (MDQA) over heterogeneous document collections—including tables, charts, and slides—where existing methods struggle with cross-modal evidence integration and contextual attribution. We propose VisDoMRAG, a vision–text dual-channel retrieval-augmented generation framework featuring a novel consistency-constrained multimodal fusion mechanism that enables collaborative cross-modal reasoning and implicit context attribution. The architecture integrates multi-step reasoning, chain-of-thought (CoT) prompting, and explicit multimodal alignment modules to enhance answer accuracy and verifiability. To rigorously evaluate MDQA systems, we introduce VisDoMBench—the first comprehensive, human-annotated benchmark covering diverse multimodal document types and challenging reasoning tasks. Extensive experiments demonstrate that VisDoMRAG achieves 12–20% higher end-to-end accuracy on VisDoMBench compared to unimodal baselines and long-context LLMs, significantly advancing trustworthy multimodal document understanding and grounded QA.
📝 Abstract
Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.