🤖 AI Summary
Small language models (SLMs) suffer from limited factual knowledge, severe hallucination, and difficulty integrating retrieval-augmented generation (RAG).
Method: We propose the first LLM-to-SLM RAG capability distillation framework, featuring a dual-path mechanism—evidence chain distillation and knowledge graph alignment—augmented by multi-stage response consistency constraints and a privacy-aware RAG architecture to ensure high-fidelity factual knowledge transfer.
Contribution/Results: This work is the first to systematically distill RAG capabilities from large language models (LLMs) to SLMs; simultaneously mitigates hallucination and user privacy risks; and introduces a dedicated RAG evaluation benchmark for SLMs. Experiments demonstrate up to 27.7% higher factual accuracy over MiniRAG across multiple benchmarks, while significantly reducing model size and computational overhead—achieving both efficient inference and trustworthy generation.
📝 Abstract
Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating hallucinated content from Humans. In this work, we introduce $ exttt{DRAG}$, a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph-based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model's predictions with a structured knowledge graph and ranked evidence, $ exttt{DRAG}$ effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With $ exttt{DRAG}$, we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-sized LLMs.