🤖 AI Summary
To address hallucination in retrieval-augmented generation (RAG) that undermines model trustworthiness, this paper proposes a lightweight, efficient, and interpretable hallucination detection method. We design a compact 4B-parameter reasoning model that directly classifies document-claim pairs and generates evidence-based natural language explanations. To mitigate annotation scarcity, we construct a domain-agnostic synthetic dataset derived from FineWeb and distill large-model reasoning capabilities via preference fine-tuning and odds-ratio optimization. Our method achieves 84.0% balanced accuracy on the RAGTruth subset—matching the performance of 7B–8B models—and 75.7% overall accuracy across the full benchmark, comparable to GPT-4o. The core contribution is the first hallucination detection paradigm that jointly achieves high efficiency, strong performance, and inherent interpretability, empirically validating the feasibility of small models for ensuring RAG reliability.
📝 Abstract
Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.