Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In discounted infinite-horizon reinforcement learning, adversarial corruption of reward signals—e.g., due to sensor failures or malicious attacks—causes standard Q-learning to fail. To address this, we propose the first asynchronous Q-learning algorithm with finite-time robustness guarantees. Our method requires no prior knowledge of the reward distribution and converges to the optimal policy at a near-optimal rate even when an arbitrary subset of rewards is corrupted. The theoretical analysis integrates adversarial robustness design, asynchronous sampling modeling, and a time-dependent variant of the Azuma–Hoeffding inequality for approximate martingales, yielding the first convergence rate bound for asynchronous Q-learning under adversarial corruption—matching the clean-case rate up to lower-order terms. We further establish an information-theoretic lower bound, proving that the derived error term is fundamentally unavoidable. This work provides both a key theoretical advance and a practical algorithmic framework for robust reinforcement learning.

Technology Category

Application Category

📝 Abstract
We consider the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting where the reward signal is subject to adversarial corruption. Such corruption, which may arise from extreme noise, sensor faults, or malicious attacks, can severely degrade the performance of classical algorithms such as Q-learning. To address this challenge, we propose a new provably robust variant of the Q-learning algorithm that operates effectively even when a fraction of the observed rewards are arbitrarily perturbed by an adversary. Under the asynchronous sampling model with time-correlated data, we establish that despite adversarial corruption, the finite-time convergence rate of our algorithm matches that of existing results for the non-adversarial case, up to an additive term proportional to the fraction of corrupted samples. Moreover, we derive an information-theoretic lower bound revealing that the additive corruption term in our upper bounds is unavoidable. Next, we propose a variant of our algorithm that requires no prior knowledge of the statistics of the true reward distributions. The analysis of this setting is particularly challenging and is enabled by carefully exploiting a refined Azuma-Hoeffding inequality for almost-martingales, a technical tool that might be of independent interest. Collectively, our contributions provide the first finite-time robustness guarantees for asynchronous Q-learning, bridging a significant gap in robust RL.
Problem

Research questions and friction points this paper is trying to address.

Learning optimal policy with adversarial reward corruption
Robust Q-learning variant for corrupted rewards
Asynchronous sampling with time-correlated data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous Q-learning with adversarial corruption tolerance
Near-optimal convergence rates despite reward corruption
No prior knowledge of reward distribution required