DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval

📅 2025-05-26

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

In long-document question answering, existing methods treat text as flat sequences or coarse-grained chunks, ignoring discourse structure and thereby introducing semantic misalignment. This work proposes the first Rhetorical Structure Theory (RST)-based hierarchical retrieval framework, modeling documents as sentence-level tree-structured semantic graphs. Our approach features: (1) discourse-aware document organization via explicit RST tree encoding; (2) LLM-driven adaptive node representation learning; and (3) a joint evidence retrieval mechanism integrating tree-structured encoding with hierarchical attention. To our knowledge, this is the first framework to systematically incorporate linguistic discourse structure throughout the entire retrieval pipeline. Evaluated on QASPER and QuALITY benchmarks, it achieves +12.1% token-level F1 and +8.3% downstream QA accuracy over strong baselines, demonstrating the critical value of discourse structure for long-text understanding.

Technology Category

Application Category

📝 Abstract

Long document understanding has become increasingly crucial in natural language processing, with retrieval-based methods emerging as a promising solution to address the context length limitations of large language models (LLMs). However, existing approaches either treat documents as flat sequences or employ arbitrary chunking strategies, failing to capture the inherent discourse structure that guides human comprehension. We present DISRetrieval, a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding. Our approach introduces three key innovations: (1) a discourse-aware document organization framework that utilizes rhetorical structure theory (RST) to create sentence-level hierarchical representations, preserving both semantic relationships and natural document flow; (2) an LLM-enhanced node representation technique that combines discourse structure with adaptive summarization to enrich tree nodes with contextual information; and (3) a hierarchical evidence retrieval mechanism that effectively selects relevant content while maintaining discourse coherence. Through comprehensive experiments on QASPER and QuALITY datasets, DISRetrieval demonstrates substantial improvements over existing methods in both token-level retrieval metrics and downstream question answering tasks. Our ablation studies confirm that incorporating discourse structure significantly enhances retrieval effectiveness across different document lengths and query types, validating the importance of linguistically-informed document representation in long-text understanding. Our code and datasets are publicly available at github/DreamH1gh/DISRetrieval to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Capturing discourse structures for long document comprehension

Enhancing question answering with rhetorical structure theory

Bridging structural and semantic information using LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages rhetorical structure theory for document comprehension

Employs LLM-enhanced node representations for semantic integration

Uses structure-guided hierarchical retrieval for document analysis

🔎 Similar Papers

No similar papers found.

Authors to Follow