🤖 AI Summary
This work addresses the challenge of efficient and stable learning in large-scale discrete action spaces, where only a small subset of actions is effective for a given task. The authors propose the Sparse Agent Control (SAC) framework, which leverages ℓ₁,₂-regularized policy learning under a compressed sensing paradigm to achieve sample-efficient optimization. They provide the first theoretical guarantees for sparse policy learning in such settings: while dense policies require Ω(M) samples, their sparse approach needs only O(k log M). Key technical contributions include a convex surrogate loss, policy-restricted strong convexity (Policy-RSC), dual witness construction, and belief error analysis under partial observability. The resulting policy achieves a value suboptimality bound of k( log M / T )¹/², exactly recovers the effective action set when T > k log M, and maintains logarithmic dependence on M even in partially observable environments.
📝 Abstract
Tool-augmented LLM systems expose a control regime that learning theory has largely ignored: sequential decision-making with a massive discrete action universe (tools, APIs, documents) in which only a small, unknown subset is relevant for any fixed task distribution. We formalize this setting as Sparse Agentic Control (SAC), where policies admit block-sparse representations over M>>1 actions and rewards depend on sparse main effects and (optionally) sparse synergies. We study ell_{1,2}-regularized policy learning through a convex surrogate and establish sharp, compressed-sensing-style results: (i) estimation and value suboptimality scale as k (log M / T)^{1/2} under a Policy-RSC condition; (ii) exact tool-support recovery holds via primal-dual witness arguments when T>k log M under incoherence and beta-min; and (iii) any dense policy class requires Omega(M) samples, explaining the instability of prompt-only controllers. We further show that under partial observability, LLMs matter only through a belief/representation error epsilon_b, yielding an additive O(epsilon_b) degradation while preserving logarithmic dependence on M. Extensions cover tuning-free, online, robust, group-sparse, and interaction-aware SAC.