Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work investigates the phenomenon of “tool-use collapse” in visual reasoning agents, where tool invocation frequency declines in late-stage training despite concurrent performance gains—particularly in complex tasks such as 3D spatial reasoning and medical visual question answering. The authors propose treating tools as training scaffolds and introduce entropy regularization to enhance exploration diversity in both language generation and tool usage, thereby promoting varied reasoning pathways rather than merely maximizing invocation frequency. Experimental results demonstrate that even with reduced tool usage, increased diversity in reasoning trajectories significantly improves model performance. This finding reveals a nonlinear relationship between tool-use frequency and reasoning capability, challenging conventional frequency-driven tool optimization paradigms and offering a new perspective on effective tool integration in multimodal reasoning systems.
📝 Abstract
Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. % We further observe similar dynamics on medical VQA, suggesting that tool-use collapse is not limited to 3D spatial reasoning. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io
Problem

Research questions and friction points this paper is trying to address.

tool-use collapse
visual chain-of-thought
visual reasoning
3D spatial reasoning
medical VQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-use collapse
visual chain-of-thought
entropy regularization
rollout diversity
scaffolded exploration
🔎 Similar Papers