On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from object hallucination—generating non-existent objects in image descriptions—primarily due to high cognitive uncertainty in certain visual tokens within the vision encoder. This work establishes, for the first time, a direct link between visual token uncertainty and object hallucination. We propose Uncertainty-Aware Token Masking (UATM), a novel mitigation method that detects uncertain tokens exhibiting large representational bias in early vision encoder layers via adversarial perturbations, and dynamically masks them during self-attention computation in intermediate layers. UATM requires modifications only to the vision encoder—leaving the language model and training data unchanged. Experiments across multiple mainstream LVLMs demonstrate that UATM significantly reduces object hallucination rates while improving description accuracy and reliability. Moreover, UATM is orthogonal to existing hallucination-mitigation techniques and can be seamlessly integrated with them.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.
Problem

Research questions and friction points this paper is trying to address.

Analyzing uncertain visual tokens causing object hallucinations in LVLMs
Identifying high epistemic uncertainty via adversarial perturbations in vision encoders
Mitigating hallucinations by masking uncertain tokens during self-attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses adversarial perturbations to identify uncertain visual tokens
Masks uncertain tokens in middle vision encoder layers
Modifies only vision encoder to reduce object hallucinations
🔎 Similar Papers
No similar papers found.
H
Hoigi Seo
Dept. of ECE, Seoul National University, Republic of Korea
Dong Un Kang
Dong Un Kang
PhD student, Seoul National University
Deep learning
H
Hyunjin Cho
Dept. of ECE, Seoul National University, Republic of Korea
J
Joohoon Lee
IPAI & INMC, Seoul National University, Republic of Korea
Se Young Chun
Se Young Chun
Department of Electrical and Computer Engineering, Seoul National University
computational imagingmachine learningsignal processingmultimodal processing