🤖 AI Summary
This work addresses state distribution shift and cross-stage error propagation in policy optimization for cyclic Markov decision processes (Cyclic MDPs) by proposing CycleFQI, a modular offline reinforcement learning framework. CycleFQI decomposes the Cyclic MDP into stage-wise subproblems and introduces stage-specific Q-function vectors to model both intra-stage dynamics and inter-stage transitions, enabling partial-stage optimization and hybrid control. The paper establishes the first provably efficient offline learning theory for structured Cyclic MDPs, mitigating the curse of dimensionality under Besov regularity assumptions and achieving improved sample complexity and convergence rates over monolithic approaches. It further provides sieve-based asymptotic inference for the optimal policy value. Empirical evaluations on synthetic data and a real-world type-1 diabetes dataset demonstrate the effectiveness and global convergence of CycleFQI.
📝 Abstract
We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI's effectiveness.