ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

This work addresses the hallucination problem in multimodal reasoning models caused by “reasoning drift”—a phenomenon where models over-attend to visual entities irrelevant to the query, leading to outputs detached from image evidence. To mitigate this, the authors propose ClueTracer, a training-free, parameter-free, and architecture-agnostic plug-in that traces task-relevant clues backward along the reasoning path from question to output to visual tokens, thereby localizing pertinent regions and suppressing spurious attention. The study is the first to formally characterize the reasoning drift mechanism, introduces ClueRecall as a dedicated evaluation metric, and demonstrates consistent performance gains: an average 1.21× improvement across multiple reasoning benchmarks and a 1.14× boost when transferred to non-reasoning tasks, significantly enhancing visual grounding and reducing hallucinations.

Technology Category

Application Category

📝 Abstract

Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model's reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.

Problem

Research questions and friction points this paper is trying to address.

hallucination

multimodal reasoning

reasoning drift

visual grounding

clue tracing

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination suppression

clue tracing

training-free