Towards Optimal Differentially Private Regret Bounds in Linear MDPs

📅 2025-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work studies the optimal regret bound for personalized decision-making in linear Markov decision processes (MDPs) under joint differential privacy (JDP). We consider the setting where both the transition dynamics and reward functions are linear in a known feature mapping. To address this, we propose the first variance-aware Bernstein-type confidence interval integrated into the LSVI-UCB++ framework, coupled with a JDP-compliant, variance-sensitive privacy mechanism. Theoretically, our algorithm achieves the state-of-the-art private regret bound $widetilde{O}(dsqrt{H^3 K} + H^{4.5} d^{7/6} K^{1/2}/epsilon)$, significantly improving upon prior results. Empirically, it attains performance close to the non-private baseline even under strong privacy guarantees ($epsilon leq 1$). Our approach establishes a new paradigm for privacy-preserving reinforcement learning that simultaneously achieves theoretical optimality and practical efficacy.

Technology Category

Application Category

📝 Abstract
We study regret minimization under privacy constraints in episodic inhomogeneous linear Markov Decision Processes (MDPs), motivated by the growing use of reinforcement learning (RL) in personalized decision-making systems that rely on sensitive user data. In this setting, both transition probabilities and reward functions are assumed to be linear in a feature mapping $phi(s, a)$, and we aim to ensure privacy through joint differential privacy (JDP), a relaxation of differential privacy suited to online learning. Prior work has established suboptimal regret bounds by privatizing the LSVI-UCB algorithm, which achieves $widetilde{O}(sqrt{d^3 H^4 K})$ regret in the non-private setting. Building on recent advances that improve this to minimax optimal regret $widetilde{O}(HDsqrt{K})$ via LSVI-UCB++ with Bernstein-style bonuses, we design a new differentially private algorithm by privatizing LSVI-UCB++ and adapting techniques for variance-aware analysis from offline RL. Our algorithm achieves a regret bound of $widetilde{O}(d sqrt{H^3 K} + H^{4.5} d^{7/6} K^{1/2} / epsilon)$, improving over previous private methods. Empirical results show that our algorithm retains near-optimal utility compared to non-private baselines, indicating that privacy can be achieved with minimal performance degradation in this setting.
Problem

Research questions and friction points this paper is trying to address.

Minimizing regret in linear MDPs with privacy constraints
Improving differentially private RL algorithm performance
Balancing privacy and utility in personalized decision-making
Innovation

Methods, ideas, or system contributions that make the work stand out.

Privatizes LSVI-UCB++ algorithm
Uses Bernstein-style bonuses
Adapts variance-aware offline RL techniques
🔎 Similar Papers
No similar papers found.