Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the challenge of zero-shot understanding of structurally inconsistent, multimodal (text/image/chart/table) long documents, this paper proposes DocsRay—a training-free document understanding framework. Methodologically, DocsRay introduces two key innovations: (i) prompt-engineering-driven multimodal large language models to autonomously generate pseudo-tables-of-contents (pseudo-TOCs), enabling unified textual representation of heterogeneous content; and (ii) a two-stage hierarchical retrieval-augmented generation (RAG) architecture that reduces retrieval complexity from O(N) to O(S + k₁·Nₛ). Evaluated on long documents averaging 49.4 pages and over 20K tokens, DocsRay achieves a 45% reduction in query latency (from 3.89 s to 2.12 s) and attains 64.7% accuracy on MMLongBench-Doc—significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce extit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.

Problem

Research questions and friction points this paper is trying to address.

Understanding complex multimodal documents with structural inconsistencies

Processing diverse document elements without specialized models or training

Reducing retrieval complexity and improving efficiency in document analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo TOC generation for document structuring

Zero-shot multimodal analysis with LLMs

Two-stage hierarchical retrieval system

🔎 Similar Papers

Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review