π€ AI Summary
This work addresses the high cost and redundancy in perception tool invocation by vision-language agents, which often lack effective pre-call judgment mechanisms. The authors propose ToolGateβa lightweight external controller that makes binary decisions to selectively execute necessary tool calls after a ReAct-style agent proposes an invocation but before the result is injected into the context, leveraging both trajectory text and structured features. As the first systematic study on pre-call control for vision-language agents, ToolGate demonstrates efficient adaptability across Qwen3-VL models, reducing token consumption to 64β69% of baseline levels across five benchmarks without accuracy degradation in cross-domain settings, and even achieving a 1.65 percentage point accuracy gain on Qwen3-VL-30B under in-domain training.
π Abstract
Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.