Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models are prone to hallucinations when answering public health policy questions, limiting their reliability in high-stakes settings. This study systematically evaluates basic and advanced Retrieval-Augmented Generation (RAG) architectures on question answering over CDC policy documents, comparing recursive character chunking with semantic chunking strategies and incorporating a cross-encoder re-ranking mechanism. Using Mistral-7B-Instruct-v0.2 as the generator and all-MiniLM-L6-v2 for embeddings, experiments demonstrate that Advanced RAG substantially improves answer faithfulness to 0.797—significantly outperforming both the vanilla LLM (0.347) and Basic RAG (0.621). These results underscore the critical role of two-stage retrieval in achieving high-fidelity responses for precise policy-related queries.

Technology Category

Application Category

📝 Abstract

The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

hallucination

Retrieval-Augmented Generation

policy document question answering

faithfulness

information integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

cross-encoder re-ranking

chunking strategies