🤖 AI Summary
This work addresses the challenge of learning equilibrium strategies in general time-inconsistent control problems by proposing a unified two-stage framework. First, an auxiliary time-consistent subproblem is solved using deterministic policy gradients based on an auxiliary function. Subsequently, the auxiliary function is iteratively refined through an inner fixed-point iteration characterized via martingale methods, progressively approximating the equilibrium strategy. This approach represents the first extension of deterministic policy gradients to continuous-time, model-free settings for time-inconsistent control, establishes convergence theory for the inner iteration, and integrates the extended Hamilton–Jacobi–Bellman equation within an actor–critic architecture. Empirical evaluations on financial applications—including mean–variance portfolio selection and non-exponential discounting optimal tracking—demonstrate the algorithm’s superior effectiveness.
📝 Abstract
In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent problem into an equivalent two-stage problem. In the first stage, for given auxiliary functions, we employ the deterministic policy gradient approach to learn an optimal policy in an auxiliary time-consistent control problem. In the second stage, given the updated policy, we exploit the inner fixed point iterations and some martingale characterizations to learn the auxiliary functions. As a theoretical contribution, we provide some mild model assumptions and establish the convergence of inner fixed point iterations. By repeating this actor-critic style of iterations across two stages, our algorithm aims to learn the equilibrium under different sources of time-inconsistency in a unified manner. The superior effectiveness of the proposed algorithm are illustrated in two classical financial applications with time-inconsistency: mean-variance portfolio management and optimal tracking portfolio under non-exponential discounting.