The Role of Environment Access in Agnostic Reinforcement Learning

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies agnostic model-free reinforcement learning (RL) in large state spaces, where the policy class Π may exclude the globally optimal policy and sample-efficient learning is statistically impossible under standard online interaction alone. The authors first establish statistical impossibility even with enhanced access mechanisms—such as local simulators or μ-reset models—demonstrating fundamental limitations across multiple environment interaction paradigms. Then, for Block MDPs, they introduce a novel policy simulator construction that enables sample-efficient learning without requiring an explicit value-function class. Their core contributions are twofold: (i) the first precise characterization of statistical impossibility boundaries under diverse environment access models; and (ii) the first positive result establishing agnostic, model-free, sample-efficient learnability in Block MDPs—yielding both necessary feasibility conditions and a constructive algorithmic framework for RL under weak prior assumptions.

Technology Category

Application Category

📝 Abstract
We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $Pi$, with no guarantee that $Pi$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: 1. Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. 2. Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called $mu$-reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. 3. On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all policies $pi in Pi$. These values are approximated without any explicit value function class.
Problem

Research questions and friction points this paper is trying to address.

Agnostic policy learning in large state spaces with function approximation
Statistical intractability with local simulator and reset distribution access
Tractability in Block MDPs with combined reset models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agnostic policy learning with local simulator access
Reset distribution coverage in policy learning
Block MDPs policy emulator algorithm
🔎 Similar Papers
No similar papers found.