Statistical analysis of Inverse Entropy-regularized Reinforcement Learning

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional inverse reinforcement learning (IRL) suffers from non-uniqueness of the reward function, leading to ill-posed estimation. To address this, we propose an inverse entropy-regularized IRL framework that jointly minimizes the least-squares soft Bellman residual and an entropy regularization term, yielding a unique and consistent reward estimate for the expert policy. Our method models the problem via entropy-regularized Markov decision processes and employs penalized maximum likelihood estimation over a policy class, with statistical complexity quantified by covering numbers; we derive a KL-divergence excess risk bound. Theoretically, we establish the first non-asymptotic minimax optimal convergence rate, unifying the trade-offs among reward smoothness, model complexity, and sample size. Empirically, our approach achieves optimal finite-sample convergence guarantees for reward estimation, demonstrating both statistical validity and robustness.

Technology Category

Application Category

📝 Abstract
Inverse reinforcement learning aims to infer the reward function that explains expert behavior observed through trajectories of state--action pairs. A long-standing difficulty in classical IRL is the non-uniqueness of the recovered reward: many reward functions can induce the same optimal policy, rendering the inverse problem ill-posed. In this paper, we develop a statistical framework for Inverse Entropy-regularized Reinforcement Learning that resolves this ambiguity by combining entropy regularization with a least-squares reconstruction of the reward from the soft Bellman residual. This combination yields a unique and well-defined so-called least-squares reward consistent with the expert policy. We model the expert demonstrations as a Markov chain with the invariant distribution defined by an unknown expert policy $π^star$ and estimate the policy by a penalized maximum-likelihood procedure over a class of conditional distributions on the action space. We establish high-probability bounds for the excess Kullback--Leibler divergence between the estimated policy and the expert policy, accounting for statistical complexity through covering numbers of the policy class. These results lead to non-asymptotic minimax optimal convergence rates for the least-squares reward function, revealing the interplay between smoothing (entropy regularization), model complexity, and sample size. Our analysis bridges the gap between behavior cloning, inverse reinforcement learning, and modern statistical learning theory.
Problem

Research questions and friction points this paper is trying to address.

Resolves reward ambiguity in inverse reinforcement learning
Estimates expert policy via penalized maximum-likelihood from demonstrations
Derives non-asymptotic convergence rates for reward function estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy regularization resolves reward ambiguity
Least-squares reconstruction from soft Bellman residual
Penalized maximum-likelihood policy estimation with complexity bounds
🔎 Similar Papers
No similar papers found.