🤖 AI Summary
To address the challenge of zero-shot understanding of structurally inconsistent, multimodal (text/image/chart/table) long documents, this paper proposes DocsRay—a training-free document understanding framework. Methodologically, DocsRay introduces two key innovations: (i) prompt-engineering-driven multimodal large language models to autonomously generate pseudo-tables-of-contents (pseudo-TOCs), enabling unified textual representation of heterogeneous content; and (ii) a two-stage hierarchical retrieval-augmented generation (RAG) architecture that reduces retrieval complexity from O(N) to O(S + k₁·Nₛ). Evaluated on long documents averaging 49.4 pages and over 20K tokens, DocsRay achieves a 45% reduction in query latency (from 3.89 s to 2.12 s) and attains 64.7% accuracy on MMLongBench-Doc—significantly outperforming state-of-the-art methods.
📝 Abstract
Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce extit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.