Semi-gradient DICE for Offline Constrained Reinforcement Learning

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses a fundamental trade-off in offline constrained reinforcement learning (OCRL): DICE-based methods improve policy performance at the expense of offline policy evaluation (OPE) reliability. We first identify that semi-gradient optimization induces stationary distribution shift, thereby breaking cost estimation consistency. To resolve this, we propose the first OPE-aware semi-gradient DICE framework: it corrects semi-gradient updates to preserve the true stationary distribution, ensuring unbiased and theoretically consistent cost estimation; simultaneously, it integrates Lagrangian duality for stable constraint satisfaction. Evaluated on the DSRL benchmark, our method achieves state-of-the-art performance—significantly improving cost estimation accuracy, constraint satisfaction rate, and OPE reliability. Notably, it is the first approach to jointly achieve high-quality offline constrained optimization and trustworthy policy evaluation in a unified, theoretically grounded framework.

Technology Category

Application Category

📝 Abstract

Stationary Distribution Correction Estimation (DICE) addresses the mismatch between the stationary distribution induced by a policy and the target distribution required for reliable off-policy evaluation (OPE) and policy optimization. DICE-based offline constrained RL particularly benefits from the flexibility of DICE, as it simultaneously maximizes return while estimating costs in offline settings. However, we have observed that recent approaches designed to enhance the offline RL performance of the DICE framework inadvertently undermine its ability to perform OPE, making them unsuitable for constrained RL scenarios. In this paper, we identify the root cause of this limitation: their reliance on a semi-gradient optimization, which solves a fundamentally different optimization problem and results in failures in cost estimation. Building on these insights, we propose a novel method to enable OPE and constrained RL through semi-gradient DICE. Our method ensures accurate cost estimation and achieves state-of-the-art performance on the offline constrained RL benchmark, DSRL.

Problem

Research questions and friction points this paper is trying to address.

Addresses stationary distribution mismatch in offline RL

Enhances DICE for accurate cost estimation

Improves constrained RL performance via semi-gradient optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-gradient DICE for offline constrained RL

Accurate cost estimation via stationary distribution correction

State-of-the-art performance on DSRL benchmark

🔎 Similar Papers

State-Constrained Offline Reinforcement Learning