🤖 AI Summary
This work studies the optimal regret bound for personalized decision-making in linear Markov decision processes (MDPs) under joint differential privacy (JDP). We consider the setting where both the transition dynamics and reward functions are linear in a known feature mapping. To address this, we propose the first variance-aware Bernstein-type confidence interval integrated into the LSVI-UCB++ framework, coupled with a JDP-compliant, variance-sensitive privacy mechanism. Theoretically, our algorithm achieves the state-of-the-art private regret bound $widetilde{O}(dsqrt{H^3 K} + H^{4.5} d^{7/6} K^{1/2}/epsilon)$, significantly improving upon prior results. Empirically, it attains performance close to the non-private baseline even under strong privacy guarantees ($epsilon leq 1$). Our approach establishes a new paradigm for privacy-preserving reinforcement learning that simultaneously achieves theoretical optimality and practical efficacy.
📝 Abstract
We study regret minimization under privacy constraints in episodic inhomogeneous linear Markov Decision Processes (MDPs), motivated by the growing use of reinforcement learning (RL) in personalized decision-making systems that rely on sensitive user data. In this setting, both transition probabilities and reward functions are assumed to be linear in a feature mapping $phi(s, a)$, and we aim to ensure privacy through joint differential privacy (JDP), a relaxation of differential privacy suited to online learning. Prior work has established suboptimal regret bounds by privatizing the LSVI-UCB algorithm, which achieves $widetilde{O}(sqrt{d^3 H^4 K})$ regret in the non-private setting. Building on recent advances that improve this to minimax optimal regret $widetilde{O}(HDsqrt{K})$ via LSVI-UCB++ with Bernstein-style bonuses, we design a new differentially private algorithm by privatizing LSVI-UCB++ and adapting techniques for variance-aware analysis from offline RL. Our algorithm achieves a regret bound of $widetilde{O}(d sqrt{H^3 K} + H^{4.5} d^{7/6} K^{1/2} / epsilon)$, improving over previous private methods. Empirical results show that our algorithm retains near-optimal utility compared to non-private baselines, indicating that privacy can be achieved with minimal performance degradation in this setting.