Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
DocVQA poses dual challenges in long-document multimodal understanding and cross-modal reasoning; existing DocRAG methods over-rely on textual content, neglect visual cues, and lack reliable benchmarks for multimodal evidence selection and integration. To address this, we propose MMDocRAG—the first retrieval-augmented QA benchmark tailored for multi-page, heterogeneous documents containing interleaved text, figures, and tables—comprising 4,055 expert-annotated QA pairs with fine-grained multimodal evidence chains. We introduce the first image-text joint citation evaluation metric and a novel multimodal evidence chain annotation paradigm. Additionally, we design a fine-grained visual description enhancement mechanism that significantly improves open-source LLM performance on DocVQA. Large-scale evaluation across 60 VLMs/LLMs and 14 retrievers reveals systematic deficiencies in multimodal evidence retrieval and integration; high-fidelity visual descriptions prove critical for performance gains; and MMDocRAG has emerged as the authoritative benchmark in this domain.

Technology Category

Application Category

📝 Abstract
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.
Problem

Research questions and friction points this paper is trying to address.

Challenges in processing lengthy multimodal documents for DocVQA
Limitations of text-centric approaches in missing visual information
Lack of robust benchmarks for multimodal evidence selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MMDocRAG benchmark with expert-annotated QA pairs
Develops metrics for multimodal quote selection evaluation
Tests 60 VLM/LLM models and 14 retrieval systems
🔎 Similar Papers
No similar papers found.
Kuicai Dong
Kuicai Dong
Huawei Noah's Ark Lab, Nanyang Technological University
Natural Language ProcessingInformation ExtractionInformation RetrievalRAGRecommendation
Y
Yujing Chang
HUAWEI NOAH’S ARK LAB
S
Shijie Huang
HUAWEI NOAH’S ARK LAB
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
R
Ruiming Tang
HUAWEI NOAH’S ARK LAB
Y
Yong Liu
HUAWEI NOAH’S ARK LAB