CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the vulnerability of existing single-pass hallucination detectors—based on internal signals such as model uncertainty or hidden-state geometry—to adversarial attacks targeting the model’s output layer. To expose this weakness, we propose CORVUS, the first white-box red-teaming framework tailored for such detectors. CORVUS leverages lightweight LoRA fine-tuning combined with teacher forcing to manipulate detector-visible internal signals, and introduces embedding-space FGSM attention perturbations for stress testing. Requiring only 0.5% trainable parameters, CORVUS effectively transfers across diverse models—including Llama-2, Vicuna, Llama-3, and Qwen2.5—and significantly degrades the performance of state-of-the-art detectors like LLM-Check, SEP, and ICR-probe. Our findings reveal critical fragility in current internal-signal-based approaches and advocate for a new paradigm in hallucination auditing that integrates external evidence or cross-model verification.

Technology Category

Application Category

📝 Abstract

Single-pass hallucination detectors rely on internal telemetry (e.g., uncertainty, hidden-state geometry, and attention) of large language models, implicitly assuming hallucinations leave separable traces in these signals. We study a white-box, model-side adversary that fine-tunes lightweight LoRA adapters on the model while keeping the detector fixed, and introduce CORVUS, an efficient red-teaming procedure that learns to camouflage detector-visible telemetry under teacher forcing, including an embedding-space FGSM attention stress test. Trained on 1,000 out-of-distribution Alpaca instructions (<0.5% trainable parameters), CORVUS transfers to FAVA-Annotation across Llama-2, Vicuna, Llama-3, and Qwen2.5, and degrades both training-free detectors (e.g., LLM-Check) and probe-based detectors (e.g., SEP, ICR-probe), motivating adversary-aware auditing that incorporates external grounding or cross-model evidence.

Problem

Research questions and friction points this paper is trying to address.

hallucination detection

adversarial red-teaming

internal signal camouflage

large language models

detector robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

red-teaming

hallucination detection

internal signal camouflage

LoRA adaptation

adversarial auditing

🔎 Similar Papers

No similar papers found.

Authors to Follow