🤖 AI Summary
This work proposes DeepRead, a structure-aware multi-turn document reasoning agent that addresses the limitation of existing retrieval methods which treat long documents as flat text and ignore their inherent hierarchical organization and discourse order, thereby constraining complex question-answering capabilities. DeepRead is the first to explicitly integrate a document’s native hierarchical structure into the retrieval process: it leverages an LLM-driven OCR pipeline to produce structured Markdown, constructs a paragraph-level coordinate index, and introduces two synergistic tools—structure-aware retrieval (Retrieve) and sequential, section-preserving reading (ReadSection)—to emulate human-like “locate-and-read” reasoning. Experimental results demonstrate that DeepRead significantly outperforms strong baselines such as Search-o1 on long-document QA tasks, and behavioral analysis confirms the effectiveness of its tool coordination and structure-guided reasoning mechanism.
📝 Abstract
With the rapid advancement of tool-use capabilities in Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) is shifting from static, one-shot retrieval toward autonomous, multi-turn evidence acquisition. However, existing agentic search frameworks typically treat long documents as flat collections of unstructured chunks, disregarding the native hierarchical organization and sequential logic essential for human comprehension. To bridge this gap, we introduce \textbf{DeepRead}, a structure-aware document reasoning agent designed to operationalize document-native structural priors into actionable reasoning capabilities. Leveraging the structural fidelity of modern OCR, DeepRead constructs a paragraph-level, coordinate-based navigation system and equips the LLM with two synergistic tools: \textsf{Retrieve} for scanning-aware localization, and \textsf{ReadSection} for contiguous, order-preserving reading within specific hierarchical scopes. This design elicits a human-like ``locate-then-read''reasoning paradigm, effectively mitigating the context fragmentation inherent in traditional retrieval methods. Extensive evaluations across four benchmarks spanning diverse document types demonstrate that DeepRead outperforms Search-o1-style agentic search baselines by an average of 10.3\%. Fine-grained behavioral analysis further confirms that DeepRead autonomously adopts human-aligned reading strategies, validating the critical role of structural awareness in achieving precise document reasoning. Our code is available at https://github.com/Zhanli-Li/DeepRead.