🤖 AI Summary
Vision-language models (VLMs) suffer from pervasive hallucination in open-domain visual question answering (VQA), necessitating reproducible and interpretable detection methods. This paper proposes HEDGE, a plug-and-play multimodal hallucination assessment framework that formulates hallucination detection as a geometric robustness problem. HEDGE integrates controllable visual perturbations, semantic embedding clustering, and uncertainty quantification—exemplified by VASE—to construct an end-to-end evaluation pipeline. Its key contribution lies in uncovering the synergistic influence of model architecture, prompt engineering, sampling strategy, and clustering methodology on detection performance, thereby establishing a computation-aware foundation for multimodal reliability assessment. Experiments on VQA-RAD and KvasirVQA-x1 demonstrate that densely encoded VLMs (e.g., Qwen2.5-VL) exhibit higher hallucination detectability. Moreover, VASE combined with embedding clustering and moderate sampling (n ≈ 10–15) achieves optimal detection accuracy.
📝 Abstract
Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures.
Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the generated answers, whereas NLI-based clustering remains advantageous for LLaVA-Med and for longer, sentence-level responses. Across configurations, the VASE metric consistently provides the most robust hallucination signal, especially when paired with embedding clustering and a moderate sampling budget (n ~ 10-15). Prompt design also matters: concise, label-style outputs offer clearer semantic structure than syntactically constrained one-sentence responses.
By framing hallucination detection as a geometric robustness problem shaped jointly by sampling scale, prompt structure, model architecture, and clustering strategy, HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE .