TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the credit misassignment problem in tool-augmented multimodal search agents, where trajectory-level advantage broadcasting in algorithms like GRPO erroneously penalizes valuable tool calls embedded within otherwise failed trajectories. The paper formally characterizes and quantifies this phenomenon for the first time and introduces a plug-and-play credit correction method that requires no additional annotations, models, or sampling. Leveraging the assumption of deterministic parameters in information-gathering tools, the approach constructs counterfactual references within the current training batch and applies a confidence-gated, conservative advantage correction mechanism to compensate for misplaced negative credit. Evaluated across multiple multimodal search benchmarks, the method consistently improves performance for three prominent reinforcement learning algorithms—GRPO, GSPO, and SAPO—with negligible computational overhead.

📝 Abstract

We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

credit misassignment

tool-augmented agents

multimodal search

reinforcement learning

trajectory-level advantage

Innovation

Methods, ideas, or system contributions that make the work stand out.

credit assignment

tool-augmented agents

policy optimization