๐ค AI Summary
This paper addresses the NP-hard job-shop scheduling problem (JSSP) by pioneering the application of offline reinforcement learning (RL) to overcome the sample inefficiency and cold-start limitations inherent in online RL. The proposed method introduces three key innovations: (1) a modified Conservative Q-Learning (CQL) algorithm tailored for maskable action spaces; (2) an entropy-based reward mechanism within a discrete Soft Actor-Critic (SAC) framework to enhance policy exploration; and (3) expert dataset augmentation via controlled noise injection and reward normalization for robust offline training. An offline Q-learning framework is constructed by integrating masked actions and entropy regularization into both quantile-based discrete Q-networks (mQRDQN) and discrete maximum-entropy SAC (mSAC). Experiments demonstrate that the approach significantly outperforms online RL on both generated and standard benchmark instances. Moreover, noisy expert data yields performance comparable toโor even exceedingโthat of pristine expert data, empirically validating the utility of counterfactual information in offline policy learning.
๐ Abstract
The Job Shop Scheduling Problem (JSSP) is a complex combinatorial optimization problem. While online Reinforcement Learning (RL) has shown promise by quickly finding acceptable solutions for JSSP, it faces key limitations: it requires extensive training interactions from scratch leading to sample inefficiency, cannot leverage existing high-quality solutions, and often yields suboptimal results compared to traditional methods like Constraint Programming (CP). We introduce Offline Reinforcement Learning for Learning to Dispatch (Offline-LD), which addresses these limitations by learning from previously generated solutions. Our approach is motivated by scenarios where historical scheduling data and expert solutions are available, although our current evaluation focuses on benchmark problems. Offline-LD adapts two CQL-based Q-learning methods (mQRDQN and discrete mSAC) for maskable action spaces, introduces a novel entropy bonus modification for discrete SAC, and exploits reward normalization through preprocessing. Our experiments demonstrate that Offline-LD outperforms online RL on both generated and benchmark instances. Notably, by introducing noise into the expert dataset, we achieve similar or better results than those obtained from the expert dataset, suggesting that a more diverse training set is preferable because it contains counterfactual information.