🤖 AI Summary
Existing quantum reinforcement learning approaches struggle to directly model the true distribution of stochastic environments, limiting their expressiveness and adaptability. This work proposes QnRL, a quantum-native reinforcement learning framework that leverages quantum superposition and entanglement in Hilbert space to directly learn conditional action policy distributions. It introduces a novel algorithm, Quantum Amplitude Feedback (QuAK), which naturally captures environmental stochasticity by comparing higher-order moments. QnRL is the first method to fully distill and optimize policy distributions using intrinsic quantum mechanisms, enabling it to represent environmental correlations inaccessible to classical or sampling-based quantum models. Experiments demonstrate that QnRL achieves up to an 82.9% improvement in peak scores across diverse environments while reducing parameter counts by 94.3% on average, significantly enhancing both the accuracy of return estimation for unseen observations and generalization across varying stochastic conditions.
📝 Abstract
Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these environments, existing QRL architectures indirectly approximate environment behavior by estimating expected outcomes, which limits their expressive power and adaptive potential. Overcoming such challenges requires a novel QRL approach that exploits the distributional nature of quantum computers to directly model environment random variables as quantum state distributions. Hence, in this paper, a novel framework dubbed quantum-native reinforcement learning (QnRL) is proposed. QnRL is a distributional RL framework that learns conditional distributions naturally in Hilbert space via superimposed and entangled quantum states. Thus, QnRL can directly model the behavior of stochastic learning environments via the natural properties of quantum systems. QnRL accomplishes this via a novel, proposed quantum amplitude kickback (QuAK) algorithm that enables comparing the $n$-th power of the $m$-th moment of multiple superimposed distributions. It is theoretically proven that a conditional action policy distribution is distilled from the moments of a quantum generative model entirely within Hilbert space via QuAK, and optimized via QnRL. This complex distribution composition is also shown to provide extra dimensions for expressing environment correlations that are unknown to purely classical and classically-sampled quantum distributional models. Experimental results across diverse environments show that QnRL achieves up to $82.9\%$ higher evaluation scores, with up to $94.3\%$ fewer parameters on average, more accurately estimates the expected return for unseen observations, and better adapts to varying stochastic conditions compared to the baseline.