🤖 AI Summary
This paper addresses performance degradation in offline reinforcement learning (ORL) for continuous state-action spaces, primarily caused by distributional shift between the learned policy and the offline dataset. We propose a novel framework integrating inverse optimization (IO) with robust, non-causal model predictive control (MPC). Our key contributions are: (1) an IO objective based on suboptimality loss, explicitly mitigating distribution mismatch between the policy and offline data; (2) a robust MPC expert whose dynamics and constraints admit exact convex reformulation, ensuring planning feasibility and closed-loop stability; and (3) the first joint modeling of IO and robust MPC, enabling efficient policy learning under extreme data scarcity. Evaluated on MuJoCo benchmarks, our method achieves state-of-the-art performance using only 0.1% of the parameters required by leading baselines, significantly reducing computational overhead. The implementation is publicly available.
📝 Abstract
Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss"from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.