🤖 AI Summary
This paper studies reinforcement learning under agnostic policy learning—i.e., when the prescribed policy class Π does not contain the globally optimal policy—and aims to identify a near-optimal policy competitive with the best in Π. To address the challenge of designing and analyzing first-order optimization methods in non-Euclidean policy spaces, we introduce the *variational gradient dominance* condition—a strictly weaker alternative to classical completeness assumptions. Building upon this, we establish unified convergence guarantees for three canonical algorithms: policy gradient descent, Frank–Wolfe, and policy mirror descent, deriving tight sample complexity bounds. Our theoretical analysis uncovers fundamental connections between policy optimization and first-order optimization theory. Empirical evaluation confirms the practical verifiability of the variational gradient dominance condition across standard RL benchmarks. Key contributions include: (i) relaxing a critical completeness assumption; (ii) providing a unified analytical framework for policy optimization; (iii) strengthening convergence guarantees; and (iv) bridging policy optimization with first-order optimization methodology.
📝 Abstract
We study reinforcement learning (RL) in the agnostic policy learning setting, where the goal is to find a policy whose performance is competitive with the best policy in a given class of interest $Π$ -- crucially, without assuming that $Π$ contains the optimal policy. We propose a general policy learning framework that reduces this problem to first-order optimization in a non-Euclidean space, leading to new algorithms as well as shedding light on the convergence properties of existing ones. Specifically, under the assumption that $Π$ is convex and satisfies a variational gradient dominance (VGD) condition -- an assumption known to be strictly weaker than more standard completeness and coverability conditions -- we obtain sample complexity upper bounds for three policy learning algorithms: emph{(i)} Steepest Descent Policy Optimization, derived from a constrained steepest descent method for non-convex optimization; emph{(ii)} the classical Conservative Policy Iteration algorithm citep{kakade2002approximately} reinterpreted through the lens of the Frank-Wolfe method, which leads to improved convergence results; and emph{(iii)} an on-policy instantiation of the well-studied Policy Mirror Descent algorithm. Finally, we empirically evaluate the VGD condition across several standard environments, demonstrating the practical relevance of our key assumption.