Generalized Linear Markov Decision Process

πŸ“… 2025-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Linear MDPs require both reward functions and state transitions to be linear, limiting their ability to model nonlinear sparse rewards (e.g., binary or count-based rewards) common in real-world applications. Method: This paper proposes the Generalized Linear MDP (GLMDP) framework, retaining linear state transitions while introducing Generalized Linear Models (GLMs) for reward modelingβ€”the first such integration. We define a novel Bellman-complete function class and design GPEVI, an offline RL algorithm with pessimism, along with its semi-supervised variant SS-GPEVI. Contribution/Results: We establish a theoretical upper bound on policy suboptimality that depends on a generalized coverage metric, substantially improving sample efficiency under label scarcity. Empirically, GLMDP consistently outperforms standard linear MDP methods on binary and count-based reward tasks. The framework provides a theoretically grounded and practically effective paradigm for sparse-feedback domains such as healthcare and e-commerce.

Technology Category

Application Category

πŸ“ Abstract
The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms: Generalized Pessimistic Value Iteration (GPEVI) and a semi-supervised variant (SS-GPEVI) that utilizes both labeled and unlabeled trajectories. Our algorithms achieve theoretical guarantees on policy suboptimality and demonstrate improved sample efficiency in settings where reward labels are expensive or limited.
Problem

Research questions and friction points this paper is trying to address.

Extends linear MDP to handle nonlinear or discrete rewards
Addresses limited data in domains like healthcare and e-commerce
Develops offline RL algorithms for improved sample efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends linear MDP with generalized linear reward models
Introduces Bellman completeness for nonlinear rewards
Develops offline RL algorithms with theoretical guarantees