🤖 AI Summary
This work addresses the low approximation accuracy and training instability of Q-function estimation in reinforcement learning. We propose directly parameterizing the Q-function using a Gaussian Mixture Model (GMM), treating it as a differentiable surrogate for the Bellman residual to enable end-to-end policy iteration without experience replay. To our knowledge, this is the first work to establish GMM as a universal Q-function approximator. We further introduce Riemannian manifold optimization to naturally enforce positive definiteness of covariance matrices, thereby improving training stability. We provide theoretical guarantees on generalization error bounds and convergence. Empirically, our method matches or surpasses state-of-the-art performance across multiple benchmark tasks, while incurring significantly lower computational overhead than DQN and eliminating reliance on experience sampling or replay buffers.
📝 Abstract
Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters -- a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices -- are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.