Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Finite-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies policy evaluation in finite-horizon offline reinforcement learning, assuming data satisfies the concentrability condition and the target policy’s Q-function is linearly realizable ($q^pi$-realizability). We establish, for the first time, that statistically efficient policy evaluation is achievable using only trajectory-level offline data. Methodologically, we introduce an analytical framework combining linear function approximation with explicit modeling of trajectory structure, enabling tighter error propagation analysis and significantly improving the sample complexity upper bound of existing policy optimization algorithms. Our main contributions are: (1) proving the theoretical feasibility of efficient policy evaluation from trajectory data under weak coverage (concentrability) and linear realizability; and (2) reducing the sample complexity of policy optimization to the same order as policy evaluation—thereby breaking prior barriers that required stronger assumptions (e.g., uniform coverage or completeness) or substantially more data.

Technology Category

Application Category

📝 Abstract
We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^π$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.
Problem

Research questions and friction points this paper is trying to address.

Efficient policy evaluation with trajectory data coverage
Overcoming impossibility under concentrability and linear realizability
Improving sample complexity for offline RL optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory data enables efficient offline policy evaluation
Linear qπ-realizability with concentrability ensures statistical efficiency
Improved sample complexity via tighter analysis of existing methods
🔎 Similar Papers
No similar papers found.