Performative Policy Gradient: Optimality in Performative Reinforcement Learning

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning, policy deployment induces distributional shifts in the environment—termed performative feedback—causing standard methods to fail. Method: We propose PePG, the first policy gradient algorithm with provable convergence to a performatively optimal policy. We establish the first theoretically grounded policy gradient framework for performative RL by introducing softmax policy parameterization and entropy regularization, deriving a performative variant of the performance difference lemma and policy gradient theorem, and designing a gradient update mechanism that explicitly corrects for environmental feedback. Contribution/Results: PePG is the first algorithm to guarantee convergence to a self-consistent, performatively optimal policy—overcoming prior works that only ensure stability without optimality guarantees. Experiments on standard performative RL benchmarks demonstrate that PePG significantly outperforms classical policy gradient methods and stability-oriented baselines, empirically validating its robustness, optimality, and superior performance.

Technology Category

Application Category

📝 Abstract
Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.
Problem

Research questions and friction points this paper is trying to address.

Addresses performative distribution shifts in reinforcement learning
Introduces first policy gradient algorithm for performative optimality
Ensures policies remain optimal under self-induced environment changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Performative Policy Gradient algorithm
Converges to performatively optimal policies
Accounts for environment shifts from algorithm influence
🔎 Similar Papers
Debabrota Basu
Debabrota Basu
Faculty, Inria at University of Lille and CNRS (CRIStAL), ELLIS Scholar
Reinforcement LearningMulti-armed BanditsDifferential PrivacyFairnessOptimization
U
Udvas Das
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL, F-59000 Lille, France
B
Brahim Driss
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL, F-59000 Lille, France
U
Uddalak Mukherjee
ACMU, Indian Statistical Institute, Kolkata, West Bengal 700108, India