π€ AI Summary
This work addresses the planning problem in risk-sensitive Markov decision processes (MDPs) with entropy-regularized risk measures (ERM) as the optimization objective. The authors propose a Monte Carlo tree search (MCTS) algorithm that integrates upper confidence bound (UCB) strategies with dynamic programming. Notably, this is the first work to establish non-asymptotic correctness guarantees and polynomial regret bounds for risk-sensitive MCTS, rigorously proving that the empirical ERM at the root node converges to the optimal value. Both theoretical analysis and empirical evaluations demonstrate that the proposed algorithm outperforms existing baselines in risk-sensitive decision-making tasks, achieving a strong balance between theoretical rigor and practical effectiveness.
π Abstract
We propose a provably correct Monte Carlo tree search (MCTS) algorithm for solving risk-aware Markov decision processes (MDPs) with entropic risk measure (ERM) objectives. We provide a non-asymptotic analysis of our proposed algorithm, showing that the algorithm: (i) is correct in the sense that the empirical ERM obtained at the root node converges to the optimal ERM; and (ii) enjoys polynomial regret concentration. Our algorithm successfully exploits the dynamic programming formulations for solving risk-aware MDPs with ERM objectives introduced by previous works in the context of an upper confidence bound-based tree search algorithm. Finally, we provide a set of illustrative experiments comparing our risk-aware MCTS method against relevant baselines.