🤖 AI Summary
This study addresses critical challenges in population health management for Medicaid beneficiaries—specifically, ensuring safety, fairness, and auditability of intervention recommendations. We propose an offline reinforcement learning framework that **decouples risk calibration from preference optimization**: a **conformal risk gating mechanism** dynamically filters unsafe actions under a prespecified risk threshold; combined with a lightweight risk model and conservative policy learning, it enables fine-grained fairness auditing. Our method integrates conformal prediction, fitted Q-evaluation (FQE), and subgroup performance analysis to build a scalable, interpretable decision system. Evaluated on real-world Medicaid data, the approach achieves a risk discrimination AUC of 0.81, a calibrated risk threshold τ = 0.038, high safety coverage, and reveals statistically significant disparities in policy value across demographic subgroups. To our knowledge, this is the first offline RL solution for high-stakes healthcare decisions that jointly guarantees safety, fairness, and verifiability.
📝 Abstract
Population health management programs for Medicaid populations coordinate longitudinal outreach and services (e.g., benefits navigation, behavioral health, social needs support, and clinical scheduling) and must be safe, fair, and auditable. We present a Hybrid Adaptive Conformal Offline Reinforcement Learning (HACO) framework that separates risk calibration from preference optimization to generate conservative action recommendations at scale. In our setting, each step involves choosing among common coordination actions (e.g., which member to contact, by which modality, and whether to route to a specialized service) while controlling the near-term risk of adverse utilization events (e.g., unplanned emergency department visits or hospitalizations). Using a de-identified operational dataset from Waymark comprising 2.77 million sequential decisions across 168,126 patients, HACO (i) trains a lightweight risk model for adverse events, (ii) derives a conformal threshold to mask unsafe actions at a target risk level, and (iii) learns a preference policy on the resulting safe subset. We evaluate policies with a version-agnostic fitted Q evaluation (FQE) on stratified subsets and audit subgroup performance across age, sex, and race. HACO achieves strong risk discrimination (AUC ~0.81) with a calibrated threshold ( τ ~0.038 at α = 0.10), while maintaining high safe coverage. Subgroup analyses reveal systematic differences in estimated value across demographics, underscoring the importance of fairness auditing. Our results show that conformal risk gating integrates cleanly with offline RL to deliver conservative, auditable decision support for population health management teams.