Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of reinforcement learning in Markov decision processes with state-dependent action sets implicitly defined by complex constraints. The authors propose the Latent Score MDP framework, which maps policies into a Euclidean score space and employs an action decoder to ensure feasibility, thereby maintaining compatibility with standard deep reinforcement learning algorithms. The key innovation is the Bellman-Taylor score decoding mechanism, which leverages a Taylor expansion of the optimal action-value function combined with latent variable modeling to efficiently handle intricate state-dependent constraints without requiring gradients through the decoder, while offering theoretical performance guarantees. Evaluated on queueing network control tasks, the method learns near-optimal state-dependent index scheduling policies, achieving near-theoretical optimality in small-scale instances and significantly outperforming existing baselines in large-scale systems.

📝 Abstract

Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman--Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.

Problem

Research questions and friction points this paper is trying to address.

Markov Decision Processes

State-Dependent Actions

Feasible Action Sets

Deep Reinforcement Learning

Operational Constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bellman-Taylor score decoding

state-dependent action sets

action decoder