ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

188K/year
πŸ€– AI Summary
This work addresses the high cost and redundancy in perception tool invocation by vision-language agents, which often lack effective pre-call judgment mechanisms. The authors propose ToolGateβ€”a lightweight external controller that makes binary decisions to selectively execute necessary tool calls after a ReAct-style agent proposes an invocation but before the result is injected into the context, leveraging both trajectory text and structured features. As the first systematic study on pre-call control for vision-language agents, ToolGate demonstrates efficient adaptability across Qwen3-VL models, reducing token consumption to 64–69% of baseline levels across five benchmarks without accuracy degradation in cross-domain settings, and even achieving a 1.65 percentage point accuracy gain on Qwen3-VL-30B under in-domain training.
πŸ“ Abstract
Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.
Problem

Research questions and friction points this paper is trying to address.

tool-augmented agents
pre-call control
vision-language models
token efficiency
perceptual tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

pre-call control
tool-augmented agents
token efficiency
vision-language models
ToolGate
πŸ”Ž Similar Papers