🤖 AI Summary
This work addresses the challenge of dynamic medical treatment, which requires joint optimization of treatment intensity and interaction timing. Existing approaches often rely on fixed interaction intervals or enforce safety only at discrete time points, failing to account for continuous state evolution and intermediate risks. The authors formulate the problem as an options-based semi-Markov decision process with trajectory-level safety constraints, where each option comprises a continuous-time treatment policy and its duration. Key contributions include a safety tightening mechanism that provably ensures trajectory-wide safety with high probability by imposing appropriate constraints at interaction times, a finite-sample policy learning theory grounded in logged data, and a data-driven conservative surrogate method. Experiments demonstrate that the proposed adaptive interaction mechanism significantly outperforms fixed-interval strategies across multiple safety policies, enhancing both treatment safety and efficacy.
📝 Abstract
Dynamic medical treatment requires deciding treatment intensity and intervention timing, while patient states evolve continuously and adverse events may occur between clinical interactions. Most existing treatment learning methods assume fixed schedules or enforce safety only at discrete decision points. We propose Interaction-Limited Safe Continuous-Time Reinforcement Learning, a framework that jointly optimizes treatment administration and clinical interaction timing under trajectory-level safety constraints. Our key idea is to reformulate the continuous time treatment problem as an option-based semi-Markov decision process, where each option specifies a continuous-time treatment policy and its duration. We develop a safety-tightening mechanism showing that suitably constructed constraints at interaction times guarantee safety over the full continuous-time trajectory with high probability. We further establish finite-sample guarantees for policy learning from logged treatment trajectories and introduce a practical data-driven conservative surrogate. Experiments show that the proposed adaptive interaction-timing mechanism improves both safety and treatment effectiveness over equidistant interaction schemes across different safe policy optimization methods.