🤖 AI Summary
This work addresses the inefficiency in large language model (LLM) agent training, where frequent external tool invocations during reinforcement learning lead to GPU underutilization and soaring computational costs. Conventional caching strategies, which disregard environmental state dependencies, often produce incorrect results upon reuse. To resolve this, the authors propose a stateful tool-value caching mechanism that records historical tool call sequences in a tree structure and employs longest-prefix matching to ensure result reuse only when the full environmental context is identical. This approach enables the first state-aware cache reuse, achieving up to 70% cache hit rates across terminal interaction, SQL generation, and video understanding tasks. It reduces the median tool execution time by as much as 6.9× without compromising accumulated reward performance.
📝 Abstract
In RL post-training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool-value cache for LLM agent post-training. TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups: a hit occurs only when the agent's full tool history matches a previously executed sequence, guaranteeing identical environment state. On three diverse workloads-terminal-based tasks, SQL generation, and video understanding. TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post-training reward accumulation.