🤖 AI Summary
This work addresses the performance degradation in multimodal large language models caused by reliance on explicit visual grounding signals—such as object bounding boxes—during inference, which can interfere with reasoning. To overcome this limitation, the authors propose iVGR, a novel framework that implicitly internalizes fine-grained visual grounding capabilities into purely textual chain-of-thought reasoning, enabling accurate perception without explicit visual inputs at test time. iVGR employs a dual-stream training strategy coupled with a consistency-based reward mechanism, leveraging reinforcement learning to align a textual reasoning stream with a high-quality visual grounding stream during training. Experimental results demonstrate that iVGR significantly outperforms existing approaches across multiple fine-grained multimodal benchmarks, while preserving inference flexibility and compatibility with tool-augmented reasoning pipelines.
📝 Abstract
While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.