🤖 AI Summary
This work studies infinite-horizon discounted Generalized Utility Markov Decision Processes (GUMDPs) under single-trajectory evaluation—i.e., policy optimization using only one observed execution trajectory. First, we establish an equivalent MDP formulation for GUMDPs in the single-trajectory regime and rigorously prove the existence of optimal policies without assuming expected utility; our framework accommodates any monotonic continuous utility function. Second, we propose an online planning framework based on Monte Carlo Tree Search (MCTS), integrating feasibility-aware policy evaluation with utility-sensitive backpropagation. Experiments across diverse generalized utility tasks demonstrate that our method significantly outperforms existing baselines, achieving both theoretical soundness—via formal guarantees on optimality and utility flexibility—and empirical effectiveness in practical settings.
📝 Abstract
In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.