Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models (VLMs) frequently suffer from visual hallucinations due to insufficient vision–language alignment during autoregressive decoding. While existing post-hoc, fine-tuning-free mitigation strategies—such as contrastive suppression or static visual weighting—reduce hallucinations, they often degrade language reasoning capability. To address this trade-off, we propose an adaptive perceptual amplification mechanism that requires no model modification or fine-tuning. Our method dynamically identifies salient visual tokens at each decoding step via attention heatmaps, then applies multi-scale region resampling and structure-preserving local amplification to iteratively refine visual grounding. The approach is plug-and-play compatible with mainstream VLMs including LLaVA and Qwen-VL. Evaluated on POPE and HalluBench, it reduces hallucination rates by 32.7% on average; simultaneously, it improves performance on comprehension benchmarks (MME, MMBench) by 2.1%. To our knowledge, this is the first fine-tuning-free method to break the long-standing fidelity–reasoning trade-off in VLMs.

Technology Category

Application Category

📝 Abstract
Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by reducing biases contrastively or amplifying the weights of visual embedding during decoding. However, these approaches improve visual perception at the cost of impairing the language reasoning capability. In this work, we propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding. Specifically, by magnifying critical regions while preserving the structural and contextual information at each decoding step, PM allows the VLM to enhance its scrutiny of the visual input, hence producing more accurate and faithful responses. Extensive experimental results demonstrate that PM not only achieves superior hallucination mitigation but also enhances language generation while preserving strong reasoning capabilities.Code is available at https://github.com/ShunqiM/PM .
Problem

Research questions and friction points this paper is trying to address.

Mitigates visual hallucination in vision-language models
Enhances visual perception without impairing language reasoning
Improves accuracy by focusing on fine-grained visual details
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively isolates relevant visual tokens
Magnifies critical regions preserving context
Enhances visual scrutiny without impairing reasoning
🔎 Similar Papers
No similar papers found.