SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision-language models (VLMs) remain plagued by object hallucination, undermining visual understanding accuracy. To address this, we propose a selective and contrastive decoding framework centered on objects: it employs attention gating to progressively select multi-scale visual features and introduces a contrastive decoding loss, an object-aware fusion module, and a theoretically grounded scale-consistency constraint. Crucially, our approach is the first to formalize human perceptual alignment as a cross-scale priority ranking and discrepancy suppression process. Evaluated on mainstream hallucination benchmarks—including POPE and HallusionBench—our method achieves an average improvement of 12.7%, significantly outperforming strong baselines such as LLaVA and Qwen-VL. Theoretical analysis establishes convergence guarantees and demonstrates superior generalization properties.

Technology Category

Application Category

📝 Abstract

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Problem

Research questions and friction points this paper is trying to address.

Addressing object hallucination in Vision-Language Models

Leveraging multi-scale visual information effectively

Reducing perceptual hallucinations via selective contrastive decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective multi-scale visual integration

Object-centric contrastive decoding

Iterative hallucination reduction technique

🔎 Similar Papers

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

2024-08-04arXiv.orgCitations: 8

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

2024-05-24Neural Information Processing SystemsCitations: 9

Authors to Follow