MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the limitations of existing multimodal Retrieval-Augmented Generation (RAG) approaches, which often disregard the complex layout structures of enterprise documents, leading to inaccurate content understanding. To overcome this, the authors propose a structure-aware multimodal RAG framework that employs a direction-adaptive document parsing strategy—explicitly modeling vertical layouts while preserving holistic semantics for horizontal structures—and integrates an LLM-driven unified transformation with a runtime multimodal assembly mechanism. The framework innovatively introduces structure-aware segmentation and dynamic routing, enabling it to maintain natural reading order and enhance answer faithfulness without requiring model fine-tuning. Additionally, the authors design FastRAGEval, an efficient evaluation metric. Experiments on large-scale heterogeneous enterprise datasets and benchmarks (SlideVQA and FinRAGBench-V) demonstrate performance gains of up to 32 percentage points over state-of-the-art vision-centric methods, with particularly strong results on report-style documents.
📝 Abstract
Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment.
Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval-augmented generation
enterprise document understanding
document structure
layout-aware parsing
structured information extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal RAG
document structure awareness
layout-aware parsing
position-aligned artifact transformation
inference-time multimodal assembly