Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge of hallucination in vision-language models (VLMs), which often arises from unstable internal assumptions during reasoning. While existing approaches rely solely on final outputs and thus struggle to detect such hallucinations effectively, this study reveals for the first time that hallucinations stem from the repeated correction and entrenchment of erroneous hypotheses across decoding layers—a phenomenon the authors term “overthinking.” To quantify this, they introduce the Overthinking Score, derived from cross-layer hypothesis tracking, attention patterns, and entropy analysis, which measures both the number and instability of competing hypotheses at each layer. This approach breaks from the conventional output-only detection paradigm and achieves state-of-the-art performance, attaining F1 scores of 78.9% on MSCOCO and 71.58% on AMBER.

Technology Category

Application Category

📝 Abstract

Vision Language models (VLMs) often hallucinate non-existent objects. Detecting hallucination is analogous to detecting deception: a single final statement is insufficient, one must examine the underlying reasoning process. Yet existing detectors rely mostly on final-layer signals. Attention-based methods assume hallucinated tokens exhibit low attention, while entropy-based ones use final-step uncertainty. Our analysis reveals the opposite: hallucinated objects can exhibit peaked attention due to contextual priors; and models often express high confidence because intermediate layers have already converged to an incorrect hypothesis. We show that the key to hallucination detection lies within the model's thought process, not its final output. By probing decoder layers, we uncover a previously overlooked behavior, overthinking: models repeatedly revise object hypotheses across layers before committing to an incorrect answer. Once the model latches onto a confounded hypothesis, it can propagate through subsequent layers, ultimately causing hallucination. To capture this behavior, we introduce the Overthinking Score, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers. This score significantly improves hallucination detection: 78.9% F1 on MSCOCO and 71.58% on AMBER.

Problem

Research questions and friction points this paper is trying to address.

hallucination

vision language models

overthinking

confounder propagation

object hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

overthinking

hallucination detection

vision language models