🤖 AI Summary
This work addresses the vulnerability of existing single-pass hallucination detectors—based on internal signals such as model uncertainty or hidden-state geometry—to adversarial attacks targeting the model’s output layer. To expose this weakness, we propose CORVUS, the first white-box red-teaming framework tailored for such detectors. CORVUS leverages lightweight LoRA fine-tuning combined with teacher forcing to manipulate detector-visible internal signals, and introduces embedding-space FGSM attention perturbations for stress testing. Requiring only 0.5% trainable parameters, CORVUS effectively transfers across diverse models—including Llama-2, Vicuna, Llama-3, and Qwen2.5—and significantly degrades the performance of state-of-the-art detectors like LLM-Check, SEP, and ICR-probe. Our findings reveal critical fragility in current internal-signal-based approaches and advocate for a new paradigm in hallucination auditing that integrates external evidence or cross-model verification.
📝 Abstract
Single-pass hallucination detectors rely on internal telemetry (e.g., uncertainty, hidden-state geometry, and attention) of large language models, implicitly assuming hallucinations leave separable traces in these signals. We study a white-box, model-side adversary that fine-tunes lightweight LoRA adapters on the model while keeping the detector fixed, and introduce CORVUS, an efficient red-teaming procedure that learns to camouflage detector-visible telemetry under teacher forcing, including an embedding-space FGSM attention stress test. Trained on 1,000 out-of-distribution Alpaca instructions (<0.5% trainable parameters), CORVUS transfers to FAVA-Annotation across Llama-2, Vicuna, Llama-3, and Qwen2.5, and degrades both training-free detectors (e.g., LLM-Check) and probe-based detectors (e.g., SEP, ICR-probe), motivating adversary-aware auditing that incorporates external grounding or cross-model evidence.