What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study addresses the critical issue of hallucination in large vision-language models (LVLMs), whose relationship with architectural design remains poorly understood. The work systematically decomposes LVLMs into three core components—language foundation, visual representation, and semantic alignment—and introduces a fine-grained hallucination taxonomy encompassing co-occurrence, similarity, and uncertainty errors. To enable rigorous evaluation, the authors propose CoSimUE, a benchmark featuring controllable perturbations targeting each hallucination type. Through comprehensive ablation studies, they find that merely scaling model size yields limited gains in hallucination mitigation; instead, stronger language models suppress co-occurrence hallucinations, high-resolution visual encoders alleviate similarity-based errors, and effective alignment strategies reduce uncertainty-driven hallucinations. Most notably, jointly optimizing visual fidelity and alignment quality delivers the most robust and holistic improvement across all hallucination categories.

📝 Abstract

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

Problem

Research questions and friction points this paper is trying to address.

hallucination

Large Vision-Language Models

architecture design

reliability

visual-language alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination robustness

architecture design

vision-language models