A Temporal Difference Method for Stochastic Continuous Dynamics

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Hamilton–Jacobi–Bellman (HJB)-driven reinforcement learning methods require full knowledge of system dynamics, rendering them inapplicable to model-free stochastic continuous-time systems. This paper proposes the first fully model-free, HJB-guided temporal-difference (TD) framework that directly approximates the HJB partial differential equation on stochastic differential equation (SDE) systems without accessing drift/diffusion coefficients. Leveraging Itô’s lemma and stochastic calculus, our method establishes a model-free gradient update rule and provides rigorous convergence guarantees. It unifies stochastic optimal control with model-free RL, overcoming the long-standing requirement in HJB-RL for exact dynamical knowledge. Experiments demonstrate substantial improvements in policy performance and sample efficiency across multiple continuous-control benchmarks, with greater stability and computational efficiency compared to transition-kernel-based approaches.

Technology Category

Application Category

📝 Abstract
For continuous systems modeled by dynamical equations such as ODEs and SDEs, Bellman's principle of optimality takes the form of the Hamilton-Jacobi-Bellman (HJB) equation, which provides the theoretical target of reinforcement learning (RL). Although recent advances in RL successfully leverage this formulation, the existing methods typically assume the underlying dynamics are known a priori because they need explicit access to the coefficient functions of dynamical equations to update the value function following the HJB equation. We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and propose the corresponding temporal difference method. We demonstrate its potential advantages over transition kernel-based formulations, both qualitatively and empirically. The proposed formulation paves the way toward bridging stochastic optimal control and model-free reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Model-free RL for continuous dynamics without known coefficients
Temporal difference method targeting HJB equation
Bridging stochastic optimal control and model-free RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-free approach targeting HJB equation
Temporal difference method for dynamics
Bridges optimal control and reinforcement learning
H
Haruki Settai
University of Tokyo
Naoya Takeishi
Naoya Takeishi
UTokyo
machine learningdynamical systems
T
Takehisa Yairi
University of Tokyo