🤖 AI Summary
This work investigates how the number of evaluation trials (i.e., sampled trajectories) affects policy performance estimation in infinite-horizon generalized utility Markov decision processes (GUMDPs), revealing a fundamental distinction from standard MDPs: the expected performance in GUMDPs explicitly depends on the trial count. We present the first systematic analysis of this phenomenon, establishing theoretical bounds on estimation bias under both discounted and average-reward GUMDPs for finite- and infinite-trial settings. For discounted GUMDPs, we derive tight upper and lower bounds on the bias. For average-reward GUMDPs, we characterize how structural parameters—such as periodicity and state connectivity—govern the bias magnitude. Numerical experiments confirm that both trajectory count and GUMDP structure jointly determine evaluation accuracy. Our results provide the first theoretical framework and practical guidelines for policy evaluation in GUMDPs.
📝 Abstract
The general-utility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy. In this work, we contribute with the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs. We show that, as opposed to standard MDPs, the number of trials plays a key-role in infinite-horizon GUMDPs and the expected performance of a given policy depends, in general, on the number of trials. We consider both discounted and average GUMDPs, where the objective function depends, respectively, on discounted and average frequencies of visitation of state-action pairs. First, we study policy evaluation under discounted GUMDPs, proving lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs. Second, we address average GUMDPs, studying how different classes of GUMDPs impact the mismatch between the finite and infinite trials formulations. Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.