SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) remain plagued by object hallucination, undermining visual understanding accuracy. To address this, we propose a selective and contrastive decoding framework centered on objects: it employs attention gating to progressively select multi-scale visual features and introduces a contrastive decoding loss, an object-aware fusion module, and a theoretically grounded scale-consistency constraint. Crucially, our approach is the first to formalize human perceptual alignment as a cross-scale priority ranking and discrepancy suppression process. Evaluated on mainstream hallucination benchmarks—including POPE and HallusionBench—our method achieves an average improvement of 12.7%, significantly outperforming strong baselines such as LLaVA and Qwen-VL. Theoretical analysis establishes convergence guarantees and demonstrates superior generalization properties.

Technology Category

Application Category

📝 Abstract
Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.
Problem

Research questions and friction points this paper is trying to address.

Addressing object hallucination in Vision-Language Models
Leveraging multi-scale visual information effectively
Reducing perceptual hallucinations via selective contrastive decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective multi-scale visual integration
Object-centric contrastive decoding
Iterative hallucination reduction technique
W
Woohyeon Park
Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Woojin Kim
Woojin Kim
Stanford University
Economics
J
Jaeik Kim
Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, South Korea
Jaeyoung Do
Jaeyoung Do
Department of Electrical and Computer Engineering, Seoul National University
Generative AI (LLMs)Multi-Modal AI (NLP/Vision)Big Data Systems