🤖 AI Summary
Multimodal large language models (MLLMs) excel at complex understanding tasks but suffer from hallucinations—visual or factual inconsistencies—that degrade reliability. Existing mitigation strategies often compromise general reasoning capabilities or rely on manually engineered perturbations to model hallucinations, limiting generalizability. To address this, we propose Decoupled Contrastive Decoding (DCD), a novel framework featuring: (1) decoupled positive/negative image projection learning, where negative projections implicitly capture authentic hallucination patterns and endow decoding with visual awareness; (2) preference-data-driven dual-branch projection; (3) a training-free inference paradigm; and (4) multi-stage joint evaluation. Experiments demonstrate that DCD matches DPO’s hallucination suppression performance while significantly outperforming hand-crafted perturbation methods—and crucially, preserves full general reasoning capability. DCD is the first method to simultaneously achieve robust hallucination mitigation and strong generalization across diverse tasks, unifying reliability and versatility in MLLM inference.
📝 Abstract
Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO's hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.