Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the stability and convergence of two-timescale stochastic approximation algorithms under Markovian noise, moving beyond the conventional assumption of independent and identically distributed noise. By introducing the running maximum of the slow-timescale parameter to control updates on the fast timescale, the paper establishes almost sure convergence of the algorithm without requiring projection operators or bounded noise spaces. Leveraging Markov chain noise modeling, martingale difference sequence analysis, and stability theory, this study provides the first proof of almost sure convergence for the eligibility-trace-based Temporal Difference with Correction (TDC) algorithm in off-policy linear function approximation settings. The results lay a rigorous theoretical foundation for the reliability of Actor-Critic and related reinforcement learning algorithms in more realistic environments governed by Markov dynamics.
📝 Abstract
This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal difference learning with gradient correction (TDC) and actor-critic methods. Previously, the stability (i.e., boundedness) and convergence of two-timescale SA were only established under i.i.d. noise. This work instead establishes the stability and convergence of two-timescale SA under Markovian noise, a setup that is more realistic in RL. Notably, we do not need to use any projection operator and the noise does not need to live in a compact space. Our key technical novelty is to control the fast timescale parameter with the running max of the slow timescale parameter, instead of with the current slow timescale parameter, as most prior works do. As a key application, we establish the first almost sure convergence of TDC with eligibility traces under off-policy learning with linear function approximation.
Problem

Research questions and friction points this paper is trying to address.

two-timescale stochastic approximation
Markovian noise
convergence
reinforcement learning
stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

two-timescale stochastic approximation
Markovian noise
almost sure convergence
temporal difference learning with gradient correction (TDC)
off-policy reinforcement learning