๐ค AI Summary
This work addresses the hallucination problem in multimodal reasoning models caused by โreasoning driftโโa phenomenon where models over-attend to visual entities irrelevant to the query, leading to outputs detached from image evidence. To mitigate this, the authors propose ClueTracer, a training-free, parameter-free, and architecture-agnostic plug-in that traces task-relevant clues backward along the reasoning path from question to output to visual tokens, thereby localizing pertinent regions and suppressing spurious attention. The study is the first to formally characterize the reasoning drift mechanism, introduces ClueRecall as a dedicated evaluation metric, and demonstrates consistent performance gains: an average 1.21ร improvement across multiple reasoning benchmarks and a 1.14ร boost when transferred to non-reasoning tasks, significantly enhancing visual grounding and reducing hallucinations.
๐ Abstract
Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model's reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.