CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of existing single-pass hallucination detectors—based on internal signals such as model uncertainty or hidden-state geometry—to adversarial attacks targeting the model’s output layer. To expose this weakness, we propose CORVUS, the first white-box red-teaming framework tailored for such detectors. CORVUS leverages lightweight LoRA fine-tuning combined with teacher forcing to manipulate detector-visible internal signals, and introduces embedding-space FGSM attention perturbations for stress testing. Requiring only 0.5% trainable parameters, CORVUS effectively transfers across diverse models—including Llama-2, Vicuna, Llama-3, and Qwen2.5—and significantly degrades the performance of state-of-the-art detectors like LLM-Check, SEP, and ICR-probe. Our findings reveal critical fragility in current internal-signal-based approaches and advocate for a new paradigm in hallucination auditing that integrates external evidence or cross-model verification.

Technology Category

Application Category

📝 Abstract
Single-pass hallucination detectors rely on internal telemetry (e.g., uncertainty, hidden-state geometry, and attention) of large language models, implicitly assuming hallucinations leave separable traces in these signals. We study a white-box, model-side adversary that fine-tunes lightweight LoRA adapters on the model while keeping the detector fixed, and introduce CORVUS, an efficient red-teaming procedure that learns to camouflage detector-visible telemetry under teacher forcing, including an embedding-space FGSM attention stress test. Trained on 1,000 out-of-distribution Alpaca instructions (<0.5% trainable parameters), CORVUS transfers to FAVA-Annotation across Llama-2, Vicuna, Llama-3, and Qwen2.5, and degrades both training-free detectors (e.g., LLM-Check) and probe-based detectors (e.g., SEP, ICR-probe), motivating adversary-aware auditing that incorporates external grounding or cross-model evidence.
Problem

Research questions and friction points this paper is trying to address.

hallucination detection
adversarial red-teaming
internal signal camouflage
large language models
detector robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

red-teaming
hallucination detection
internal signal camouflage
LoRA adaptation
adversarial auditing
🔎 Similar Papers
No similar papers found.
N
Nay Myat Min
Singapore Management University
L
Long H. Pham
Singapore Management University
Hongyu Zhang
Hongyu Zhang
Chongqing University
Software EngineeringMining Software RepositoriesData-driven Software EngineeringSoftware Analytics
J
Jun Sun
Singapore Management University