Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
DQN’s target Q-value updates often rely on next states sampled under a policy-inconsistent behavior, leading to high variance and distorted learning signals. To address this, we propose Successor-State-Aware Deep Q-Learning (SADQ), which introduces three key innovations: (i) explicit modeling of stochastic environment dynamics; (ii) expectation-based target Q-value estimation over the successor-state distribution, yielding low-variance, unbiased value updates; and (iii) improved policy consistency via successor-state prediction. SADQ integrates deep Q-networks with stochastic transition modeling and prioritized experience replay. Evaluated on standard RL benchmarks—including Atari and MuJoCo—and real-world vector control tasks, SADQ consistently outperforms DQN and its major variants, achieving faster convergence, enhanced training stability, and superior final performance.

Technology Category

Application Category

📝 Abstract
Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer. However, the target updates in DQN often rely on next states generated by actions from past, potentially suboptimal, policy. As a result, these states may not provide informative learning signals, causing high variance into the update process. This issue is exacerbated when the sampled transitions are poorly aligned with the agent's current policy. To address this limitation, we propose the Successor-state Aggregation Deep Q-Network (SADQ), which explicitly models environment dynamics using a stochastic transition model. SADQ integrates successor-state distributions into the Q-value estimation process, enabling more stable and policy-aligned value updates. Additionally, it explores a more efficient action selection strategy with the modeled transition structure. We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance. Our extensive empirical results across standard RL benchmarks and real-world vector-based control tasks demonstrate that SADQ consistently outperforms DQN variants in both stability and learning efficiency.
Problem

Research questions and friction points this paper is trying to address.

DQN target updates use outdated states from suboptimal policies
This causes high variance and uninformative learning signals
SADQ models successor states for stable policy-aligned updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses successor-state distributions for Q-value estimation
Integrates stochastic transition model into DQN framework
Explores efficient action selection with transition structure
🔎 Similar Papers
No similar papers found.
L
Lipeng Zu
Department of Computer Science, Florida State University, Tallahassee, FL 32306, USA
H
Hansong Zhou
Department of Computer Science, Florida State University, Tallahassee, FL 32306, USA
Xiaonan Zhang
Xiaonan Zhang
Assistant Professor of Computer Science, Florida State University
Wireless communication and networkEdge AIInternet of Things