Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of learning equilibrium strategies in general time-inconsistent control problems by proposing a unified two-stage framework. First, an auxiliary time-consistent subproblem is solved using deterministic policy gradients based on an auxiliary function. Subsequently, the auxiliary function is iteratively refined through an inner fixed-point iteration characterized via martingale methods, progressively approximating the equilibrium strategy. This approach represents the first extension of deterministic policy gradients to continuous-time, model-free settings for time-inconsistent control, establishes convergence theory for the inner iteration, and integrates the extended Hamilton–Jacobi–Bellman equation within an actor–critic architecture. Empirical evaluations on financial applications—including mean–variance portfolio selection and non-exponential discounting optimal tracking—demonstrate the algorithm’s superior effectiveness.

📝 Abstract

In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent problem into an equivalent two-stage problem. In the first stage, for given auxiliary functions, we employ the deterministic policy gradient approach to learn an optimal policy in an auxiliary time-consistent control problem. In the second stage, given the updated policy, we exploit the inner fixed point iterations and some martingale characterizations to learn the auxiliary functions. As a theoretical contribution, we provide some mild model assumptions and establish the convergence of inner fixed point iterations. By repeating this actor-critic style of iterations across two stages, our algorithm aims to learn the equilibrium under different sources of time-inconsistency in a unified manner. The superior effectiveness of the proposed algorithm are illustrated in two classical financial applications with time-inconsistency: mean-variance portfolio management and optimal tracking portfolio under non-exponential discounting.

Problem

Research questions and friction points this paper is trying to address.

time-inconsistent control

deterministic policy

equilibrium learning

reinforcement learning

continuous-time

Innovation

Methods, ideas, or system contributions that make the work stand out.

deterministic policy gradient

time-inconsistent control

extended Hamilton-Jacobi-Bellman