Generalized Linear Markov Decision Process

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Linear MDPs require both reward functions and state transitions to be linear, limiting their ability to model nonlinear sparse rewards (e.g., binary or count-based rewards) common in real-world applications. Method: This paper proposes the Generalized Linear MDP (GLMDP) framework, retaining linear state transitions while introducing Generalized Linear Models (GLMs) for reward modeling—the first such integration. We define a novel Bellman-complete function class and design GPEVI, an offline RL algorithm with pessimism, along with its semi-supervised variant SS-GPEVI. Contribution/Results: We establish a theoretical upper bound on policy suboptimality that depends on a generalized coverage metric, substantially improving sample efficiency under label scarcity. Empirically, GLMDP consistently outperforms standard linear MDP methods on binary and count-based reward tasks. The framework provides a theoretically grounded and practically effective paradigm for sparse-feedback domains such as healthcare and e-commerce.

Technology Category

Application Category

📝 Abstract

The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms: Generalized Pessimistic Value Iteration (GPEVI) and a semi-supervised variant (SS-GPEVI) that utilizes both labeled and unlabeled trajectories. Our algorithms achieve theoretical guarantees on policy suboptimality and demonstrate improved sample efficiency in settings where reward labels are expensive or limited.

Problem

Research questions and friction points this paper is trying to address.

Extends linear MDP to handle nonlinear or discrete rewards

Addresses limited data in domains like healthcare and e-commerce

Develops offline RL algorithms for improved sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends linear MDP with generalized linear reward models

Introduces Bellman completeness for nonlinear rewards

Develops offline RL algorithms with theoretical guarantees

🔎 Similar Papers

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models