🤖 AI Summary
This work addresses the challenge of visual hallucinations in multimodal large language models (MLLMs), which often arise from overreliance on linguistic priors during generation. Existing decoding strategies that indiscriminately suppress such priors risk disrupting the semantic manifold structure. To tackle this, the authors propose MGAP, a geometrically aware, training-free decoding method that reveals the dual nature of linguistic priors and formally defines the “manifold deviation” problem. MGAP constructs a linguistic prior subspace via singular value decomposition and applies consistency-aware adaptive projection and gating to multimodal hidden states during decoding, selectively attenuating harmful prior components while preserving orthogonal semantic information. Experiments demonstrate that MGAP significantly outperforms current approaches on the POPE and CHAIR benchmarks, achieving stronger hallucination suppression without compromising textual coherence.
📝 Abstract
MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizing language priors. However, these methods overlook the dual nature of language priors, where they can be both helpful and harmful depending on the alignment with visual evidence. In particular, blindly suppressing language priors often disrupts the model's semantic manifold, leading to performance degradation, a phenomenon we term Manifold Departure. To address this, we propose Manifold-Guided Adaptive Projection (MGAP), a geometry-aware, training-free decoding method that mitigates hallucinations while preserving representation structure. MGAP first constructs a language-prior subspace from blind hidden states via SVD. During decoding, MGAP projects each multimodal hidden state onto this subspace and applies a consistency-aware gate to adaptively attenuate only the projected prior component, yielding a subspace-selective update that largely preserves the orthogonal semantic components. Extensive experiments on POPE and CHAIR show that MGAP outperforms prior decoding baselines, achieving stronger hallucination suppression without sacrificing coherence.