Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the computational and memory challenges posed by infinite-dimensional function estimation in reinforcement learning with continuous state spaces by proposing Q-Measure-Learning, a novel approach that reformulates Q-function learning as a measure learning problem. The method constructs signed empirical measures over visited state–action pairs and reconstructs the action-value function via kernel integration. By coupling the stationary distribution with the Q-measure through stochastic approximation, it achieves linear memory and computational complexity using a single online trajectory. Under a uniform ergodicity assumption, the algorithm is shown to converge almost surely and uniformly, with theoretical analysis quantifying the influence of kernel bandwidth on Q-function approximation error. Empirical validation on a two-product inventory control task demonstrates the effectiveness of the proposed approach.

Technology Category

Application Category

📝 Abstract

We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

Problem

Research questions and friction points this paper is trying to address.

continuous state RL

Markov decision processes

Q-learning

online learning

function approximation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-Measure-Learning

continuous state RL

kernel integration