Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
To address policy degradation caused by abrupt environmental distribution shifts in non-stationary reinforcement learning, this paper proposes DQUCB—a Q-learning framework capable of detecting and adapting to transition distribution changes. DQUCB models the state-action transition density via non-parametric density estimation, quantifies distributional shift likelihood, and integrates this measure into an upper-confidence-bound (UCB) exploration mechanism to jointly enable dynamic change-point detection and optimized exploration-exploitation trade-offs. Theoretically, DQUCB achieves tighter regret bounds than existing methods under both finite-horizon and infinite-horizon MDPs. Empirically, it is evaluated across diverse RL benchmarks and a real-world application—hospital resource allocation for COVID-19 patients—demonstrating significant reductions in cumulative regret while maintaining computational efficiency.

Technology Category

Application Category

📝 Abstract
We study the Non-Stationary Reinforcement Learning (RL) under distribution shifts in both finite-horizon episodic and infinite-horizon discounted Markov Decision Processes (MDPs). In the finite-horizon case, the transition functions may suddenly change at a particular episode. In the infinite-horizon setting, such changes can occur at an arbitrary time step during the agent's interaction with the environment. While the Q-learning Upper Confidence Bound algorithm (QUCB) can discover a proper policy during learning, due to the distribution shifts, this policy can exploit sub-optimal rewards after the shift happens. To address this issue, we propose Density-QUCB (DQUCB), a shift-aware Q-learning~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~UCB, resulting in a balance between exploration and exploitation. Theoretically, we prove that our oracle DQUCB achieves a better regret guarantee than QUCB. Empirically, our DQUCB enjoys the computational efficiency of model-free RL and outperforms QUCB baselines by having a lower regret across RL tasks, as well as a real-world COVID-19 patient hospital allocation task using a Deep-Q-learning architecture.
Problem

Research questions and friction points this paper is trying to address.

Addresses non-stationary reinforcement learning with distribution shifts
Proposes shift-aware Q-learning to detect environmental changes dynamically
Improves regret guarantees for episodic and infinite-horizon MDPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses density function to detect distribution shifts
Enhances uncertainty estimation in Q-learning UCB
Balances exploration and exploitation in non-stationary RL