From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the lack of continuous-time dynamic modeling in deep reinforcement learning for continuous environments by formulating reinforcement learning as a continuous-time stochastic process. Integrating stochastic control theory, the authors develop an actor-critic framework that explicitly incorporates exploration and stochastic state transitions. They derive, for the first time, an infinitesimal evolution equation characterizing how the state distribution changes under policy gradient updates. Leveraging stochastic differential equations, two-timescale analysis, and the infinite-width limit theory of neural networks, they propose a nonparametric approach to analyze overparameterized neural actor-critic algorithms. The theoretical findings are validated on toy continuous-control tasks, revealing the dynamic interplay between evolving state distributions and value function estimates during training.
📝 Abstract
We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment's state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.
Problem

Research questions and friction points this paper is trying to address.

continuous reinforcement learning
stochastic differential equations
actor-critic algorithms
overparametrized neural networks
continuous-time dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous-time reinforcement learning
actor-critic algorithms
stochastic differential equations
two time scale dynamics
infinite-width limit