ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning

๐Ÿ“… 2026-02-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the hallucination problem in multimodal reasoning models caused by โ€œreasoning driftโ€โ€”a phenomenon where models over-attend to visual entities irrelevant to the query, leading to outputs detached from image evidence. To mitigate this, the authors propose ClueTracer, a training-free, parameter-free, and architecture-agnostic plug-in that traces task-relevant clues backward along the reasoning path from question to output to visual tokens, thereby localizing pertinent regions and suppressing spurious attention. The study is the first to formally characterize the reasoning drift mechanism, introduces ClueRecall as a dedicated evaluation metric, and demonstrates consistent performance gains: an average 1.21ร— improvement across multiple reasoning benchmarks and a 1.14ร— boost when transferred to non-reasoning tasks, significantly enhancing visual grounding and reducing hallucinations.

Technology Category

Application Category

๐Ÿ“ Abstract
Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model's reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.
Problem

Research questions and friction points this paper is trying to address.

hallucination
multimodal reasoning
reasoning drift
visual grounding
clue tracing
Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination suppression
clue tracing
training-free
multimodal reasoning
visual grounding
G
Gongli Xi
Beijing University of Posts and Telecommunications, Beijing, China
Kun Wang
Kun Wang
Singapore University of Technology and Design
Deep LearningComputer Vision
Z
Zeming Gao
Beijing University of Posts and Telecommunications, Beijing, China
H
Huahui Yi
West China Biomedical Big Data Center, West China Hospital, SCU
H
Haolang Lu
Beijing University of Posts and Telecommunications, Beijing, China
Y
Ye Tian
Beijing University of Posts and Telecommunications, Beijing, China
Wendong Wang
Wendong Wang
China University of Petroleum(East China)
Flow in Porous MediaCCUSUnconventional Resource Development