LettuceDetect: A Hallucination Detection Framework for RAG Applications

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses hallucinated answers in Retrieval-Augmented Generation (RAG) systems caused by improper integration of external knowledge. We propose a lightweight, fine-grained token-level hallucination detection framework. Methodologically, we innovatively adopt ModernBERT—an efficient encoder supporting 8K-context windows—to construct a low-overhead token classification architecture, thereby overcoming the context-length and computational-cost bottlenecks inherent in conventional models. The framework jointly encodes context-question-answer triplets and is fine-tuned on the RAGTruth benchmark. Experimental results demonstrate that our approach achieves a 79.22% instance-level F1 score—outperforming the state-of-the-art Luna model by 14.8%—while using only 1/30 of its parameters and attaining a throughput of 30–60 instances per second on a single GPU. These advances significantly enhance both the reliability and deployment efficiency of RAG systems in production environments.

Technology Category

Application Category

📝 Abstract
Retrieval Augmented Generation (RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of LLM based approaches. Building on ModernBERT's extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.
Problem

Research questions and friction points this paper is trying to address.

Detects hallucinated answers in RAG systems.
Overcomes context window and inefficiency limitations.
Improves detection accuracy and processing speed.
Innovation

Methods, ideas, or system contributions that make the work stand out.

ModernBERT extended context
Token-classification model
High computational efficiency
🔎 Similar Papers
No similar papers found.
Á
Ádám Kovács
KR Labs
Gábor Recski
Gábor Recski
TU Wien
NLPsemantics