🤖 AI Summary
This paper investigates the computational complexity of reinforcement learning under partial $q^pi$-realizability: the difficulty of learning an $varepsilon$-optimal policy when the value functions of all policies in a given policy class $Pi$ admit linear representations. This setting lies strictly between classical $q^pi$- and $q^*$-realizability, and is the first to characterize computational bottlenecks when *non-optimal* policies also satisfy linear representability. The analysis proceeds via reductions to $delta$-Max-3SAT and its bounded variants. Under both parameterized greedy (GLinear-$kappa$-RL) and softmax (SLinear-$kappa$-RL) policy parametrizations, we prove that learning an $varepsilon$-optimal policy is NP-hard. Furthermore, assuming the randomized exponential-time hypothesis (rETH), we establish a $2^{Omega(kappa)}$ time lower bound for the softmax case, exposing an inherent hardness of forward algorithms in the generative model setting.
📝 Abstract
This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial $q^π$-realizability. In this framework, the objective is to learn an $ε$-optimal policy with respect to a predefined policy set $Π$, under the assumption that all value functions for policies in $Π$ are linearly realizable. The assumptions of this framework are weaker than those in $q^π$-realizability but stronger than those in $q^*$-realizability, providing a practical model where function approximation naturally arises. We prove that learning an $ε$-optimal policy in this setting is computationally hard. Specifically, we establish NP-hardness under a parameterized greedy policy set (argmax) and show that - unless NP = RP - an exponential lower bound (in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those in $q^*$-realizability and suggest computational difficulty persists even when $Π$ is expanded beyond the optimal policy. To establish this, we reduce from two complexity problems, $δ$-Max-3SAT and $δ$-Max-3SAT(b), to instances of GLinear-$κ$-RL (greedy policy) and SLinear-$κ$-RL (softmax policy). Our findings indicate that positive computational results are generally unattainable in partial $q^π$-realizability, in contrast to $q^π$-realizability under a generative access model.