🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from object hallucination—generating non-existent objects in image descriptions—primarily due to high cognitive uncertainty in certain visual tokens within the vision encoder. This work establishes, for the first time, a direct link between visual token uncertainty and object hallucination. We propose Uncertainty-Aware Token Masking (UATM), a novel mitigation method that detects uncertain tokens exhibiting large representational bias in early vision encoder layers via adversarial perturbations, and dynamically masks them during self-attention computation in intermediate layers. UATM requires modifications only to the vision encoder—leaving the language model and training data unchanged. Experiments across multiple mainstream LVLMs demonstrate that UATM significantly reduces object hallucination rates while improving description accuracy and reliability. Moreover, UATM is orthogonal to existing hallucination-mitigation techniques and can be seamlessly integrated with them.
📝 Abstract
Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.