Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Traditional RAG systems rely on fixed-size text chunking, ignoring document structure and thereby causing semantic fragmentation and suboptimal retrieval relevance. To address this, we propose a hierarchical text segmentation and clustering-enhanced RAG framework. First, we perform structure-aware paragraph-level segmentation; then, we apply semantic clustering on paragraph embeddings to construct a dual-granularity vector index—comprising both paragraph-level and cluster-level representations. During retrieval, our method jointly leverages fine-grained paragraph matching and coarse-grained cluster-level semantic generalization, yielding structurally informed and semantically coherent retrieval units. Evaluated on three open-domain question answering benchmarks—NarrativeQA, QuALITY, and QASPER—our approach significantly outperforms standard chunking baselines, improving answer accuracy and contextual consistency. This work establishes a scalable, structure-aware paradigm for semantic retrieval in RAG systems.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.

Problem

Research questions and friction points this paper is trying to address.

Improving semantic coherence in RAG text chunks

Enhancing retrieval precision with hierarchical segmentation

Addressing limitations of traditional chunking methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical text segmentation for semantic chunks

Segment and cluster-level vector retrieval

Improved RAG precision with structured clustering

🔎 Similar Papers

No similar papers found.