Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of zero-shot understanding of structurally inconsistent, multimodal (text/image/chart/table) long documents, this paper proposes DocsRay—a training-free document understanding framework. Methodologically, DocsRay introduces two key innovations: (i) prompt-engineering-driven multimodal large language models to autonomously generate pseudo-tables-of-contents (pseudo-TOCs), enabling unified textual representation of heterogeneous content; and (ii) a two-stage hierarchical retrieval-augmented generation (RAG) architecture that reduces retrieval complexity from O(N) to O(S + k₁·Nₛ). Evaluated on long documents averaging 49.4 pages and over 20K tokens, DocsRay achieves a 45% reduction in query latency (from 3.89 s to 2.12 s) and attains 64.7% accuracy on MMLongBench-Doc—significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce extit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.
Problem

Research questions and friction points this paper is trying to address.

Understanding complex multimodal documents with structural inconsistencies
Processing diverse document elements without specialized models or training
Reducing retrieval complexity and improving efficiency in document analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo TOC generation for document structuring
Zero-shot multimodal analysis with LLMs
Two-stage hierarchical retrieval system
🔎 Similar Papers
No similar papers found.
H
Hyeon Seong Jeong
Sogang University
S
Sangwoo Jo
Sogang University
B
Byeong Hyun Yoon
Sogang University
Yoonseok Heo
Yoonseok Heo
Sogang University
Natural Language ProcessingNeural Machine TranslationMultimodal Neural Machine TranslationNLG
H
Haedong Jeong
Sogang University
T
Taehoon Kim
Sogang University