π€ AI Summary
Linear MDPs require both reward functions and state transitions to be linear, limiting their ability to model nonlinear sparse rewards (e.g., binary or count-based rewards) common in real-world applications.
Method: This paper proposes the Generalized Linear MDP (GLMDP) framework, retaining linear state transitions while introducing Generalized Linear Models (GLMs) for reward modelingβthe first such integration. We define a novel Bellman-complete function class and design GPEVI, an offline RL algorithm with pessimism, along with its semi-supervised variant SS-GPEVI.
Contribution/Results: We establish a theoretical upper bound on policy suboptimality that depends on a generalized coverage metric, substantially improving sample efficiency under label scarcity. Empirically, GLMDP consistently outperforms standard linear MDP methods on binary and count-based reward tasks. The framework provides a theoretically grounded and practically effective paradigm for sparse-feedback domains such as healthcare and e-commerce.
π Abstract
The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms: Generalized Pessimistic Value Iteration (GPEVI) and a semi-supervised variant (SS-GPEVI) that utilizes both labeled and unlabeled trajectories. Our algorithms achieve theoretical guarantees on policy suboptimality and demonstrate improved sample efficiency in settings where reward labels are expensive or limited.