๐ค AI Summary
Existing Prompting Decision Transformers (PDTs) employ uniform random sampling of task-conditioning trajectories for prompting in offline multi-task reinforcement learning, ignoring inter-trajectory differences in informativeness and thereby degrading generalization. This paper proposes a lightweight bandit-based prompt tuning methodโthe first to integrate online multi-armed bandit mechanisms into offline RL prompt optimization. Without fine-tuning the backbone model, it dynamically selects high-informativeness trajectories to construct task-specific prompts. The approach establishes an efficient bridge between generic pre-trained models and task-specific adaptation. Experiments on standard benchmarks and a newly constructed multi-task offline RL environment demonstrate substantial improvements in downstream task performance. Results empirically validate that prompt quality is a critical determinant of Decision Transformer generalization, highlighting the efficacy of adaptive, information-aware prompting over static or random strategies.
๐ Abstract
Harnessing large offline datasets is vital for training foundation models that can generalize across diverse tasks. Offline Reinforcement Learning (RL) offers a powerful framework for these scenarios, enabling the derivation of optimal policies even from suboptimal data. The Prompting Decision Transformer (PDT) is an offline RL multi-task model that distinguishes tasks through stochastic trajectory prompts, which are task-specific tokens maintained in context during rollouts. However, PDT samples these tokens uniformly at random from per-task demonstration datasets, failing to account for differences in token informativeness and potentially leading to performance degradation. To address this limitation, we introduce a scalable bandit-based prompt-tuning method that dynamically learns to construct high-performance trajectory prompts. Our approach significantly enhances downstream task performance without modifying the pre-trained Transformer backbone. Empirical results on benchmark tasks and a newly designed multi-task environment demonstrate the effectiveness of our method, creating a seamless bridge between general multi-task offline pre-training and task-specific online adaptation.