Towards Reliable Retrieval in RAG Systems for Large Legal Datasets

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Legal RAG systems frequently suffer from document-level retrieval mismatch (DRM)—a critical failure mode wherein retrieved information originates from entirely incorrect source documents—due to high structural similarity among legal texts, severely compromising reliability. This paper provides the first formal definition and quantitative characterization of DRM. We propose Summary-Augmented Chunking (SAC), a novel, expert-knowledge-free chunking method that explicitly injects global semantic context by concatenating document-level synthetic summaries with original text chunks. SAC is compatible with standard chunking strategies and off-the-shelf summarization models. By enhancing the retriever’s ability to discriminate document boundaries, SAC reduces DRM incidence by up to 42% across multiple legal retrieval benchmarks, while simultaneously improving both precision and recall at the text level. These results demonstrate SAC’s effectiveness, scalability, and practical utility in the legal domain.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.

Problem

Research questions and friction points this paper is trying to address.

Addressing retrieval failures in legal RAG systems

Mitigating document-level retrieval mismatch in legal databases

Enhancing retrieval accuracy with summary-augmented chunking technique

Innovation

Methods, ideas, or system contributions that make the work stand out.

Summary-Augmented Chunking enhances text chunks

Document-level synthetic summary provides global context

Generic summarization strategy outperforms legal expert approach

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval