🤖 AI Summary
Manual tuning of Markov Chain Monte Carlo (MCMC) samplers for complex probabilistic models is labor-intensive and yields poor adaptability. To address this, we propose a reinforcement learning (RL)-based adaptive MCMC framework. We formulate the Metropolis–Hastings algorithm as a Markov decision process and design an adaptive gradient-based proposal kernel that balances learnability and flexibility. Crucially, we introduce a contrastive-divergence-driven reward function—overcoming the sparsity and non-stationarity issues inherent in conventional metrics (e.g., acceptance rate) during RL training. Policy gradient methods are employed to optimize the sampling policy end-to-end. Experiments on the posteriordb benchmark demonstrate that our approach significantly improves convergence speed and effective sample size (ESS), outperforming both classical adaptive MCMC methods and manually tuned samplers.
📝 Abstract
Sampling algorithms drive probabilistic machine learning, and recent years have seen an explosion in the diversity of tools for this task. However, the increasing sophistication of sampling algorithms is correlated with an increase in the tuning burden. There is now a greater need than ever to treat the tuning of samplers as a learning task in its own right. In a conceptual breakthrough, Wang et al (2025) formulated Metropolis-Hastings as a Markov decision process, opening up the possibility for adaptive tuning using Reinforcement Learning (RL). Their emphasis was on theoretical foundations; realising the practical benefit of Reinforcement Learning Metropolis-Hastings (RLMH) was left for subsequent work. The purpose of this paper is twofold: First, we observe the surprising result that natural choices of reward, such as the acceptance rate, or the expected squared jump distance, provide insufficient signal for training RLMH. Instead, we propose a novel reward based on the contrastive divergence, whose superior performance in the context of RLMH is demonstrated. Second, we explore the potential of RLMH and present adaptive gradient-based samplers that balance flexibility of the Markov transition kernel with learnability of the associated RL task. A comprehensive simulation study using the posteriordb benchmark supports the practical effectiveness of RLMH.