Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work studies infinite-horizon discounted Generalized Utility Markov Decision Processes (GUMDPs) under single-trajectory evaluation—i.e., policy optimization using only one observed execution trajectory. First, we establish an equivalent MDP formulation for GUMDPs in the single-trajectory regime and rigorously prove the existence of optimal policies without assuming expected utility; our framework accommodates any monotonic continuous utility function. Second, we propose an online planning framework based on Monte Carlo Tree Search (MCTS), integrating feasibility-aware policy evaluation with utility-sensitive backpropagation. Experiments across diverse generalized utility tasks demonstrate that our method significantly outperforms existing baselines, achieving both theoretical soundness—via formal guarantees on optimality and utility flexibility—and empirical effectiveness in practical settings.

Technology Category

Application Category

📝 Abstract

In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

Problem

Research questions and friction points this paper is trying to address.

Solving infinite-horizon general-utility MDPs in single-trial regime

Investigating policy optimization and computational hardness for single-trial performance

Leveraging online planning techniques like Monte-Carlo tree search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Solving infinite-horizon GUMDPs in single-trial regime

Using Monte-Carlo tree search for online planning

Demonstrating superior performance against baselines

🔎 Similar Papers

The Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes