Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work challenges the common assumption that tool usage inherently enhances multimodal reasoning capabilities by introducing a distinction between tool availability and actual tool contribution. It systematically evaluates two multimodal agents—Thyme and DeepEyesV2—across real-world understanding, OCR, chart interpretation, and mathematical reasoning tasks under three conditions: with tools, without tools, and using pure textual reasoning. Through ablation studies analyzing tool invocation traces, execution outcomes, and output formatting constraints, the study reveals that tool use does not consistently improve performance: 93%–96% of problems solved with tools could also be resolved without them, and tool invocation fails to substantially reduce computational or generative overhead. These findings highlight a significant disconnect between tool calling and genuine capability gains.

📝 Abstract

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.

Problem

Research questions and friction points this paper is trying to address.

multimodal agents

tool use

capability evaluation

tool contribution

benchmark gains

Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-augmented agents

multimodal reasoning

systematic evaluation