Enhancing Pre-Trained Decision Transformers with Prompt-Tuning Bandits

๐Ÿ“… 2025-02-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing Prompting Decision Transformers (PDTs) employ uniform random sampling of task-conditioning trajectories for prompting in offline multi-task reinforcement learning, ignoring inter-trajectory differences in informativeness and thereby degrading generalization. This paper proposes a lightweight bandit-based prompt tuning methodโ€”the first to integrate online multi-armed bandit mechanisms into offline RL prompt optimization. Without fine-tuning the backbone model, it dynamically selects high-informativeness trajectories to construct task-specific prompts. The approach establishes an efficient bridge between generic pre-trained models and task-specific adaptation. Experiments on standard benchmarks and a newly constructed multi-task offline RL environment demonstrate substantial improvements in downstream task performance. Results empirically validate that prompt quality is a critical determinant of Decision Transformer generalization, highlighting the efficacy of adaptive, information-aware prompting over static or random strategies.

Technology Category

Application Category

๐Ÿ“ Abstract
Harnessing large offline datasets is vital for training foundation models that can generalize across diverse tasks. Offline Reinforcement Learning (RL) offers a powerful framework for these scenarios, enabling the derivation of optimal policies even from suboptimal data. The Prompting Decision Transformer (PDT) is an offline RL multi-task model that distinguishes tasks through stochastic trajectory prompts, which are task-specific tokens maintained in context during rollouts. However, PDT samples these tokens uniformly at random from per-task demonstration datasets, failing to account for differences in token informativeness and potentially leading to performance degradation. To address this limitation, we introduce a scalable bandit-based prompt-tuning method that dynamically learns to construct high-performance trajectory prompts. Our approach significantly enhances downstream task performance without modifying the pre-trained Transformer backbone. Empirical results on benchmark tasks and a newly designed multi-task environment demonstrate the effectiveness of our method, creating a seamless bridge between general multi-task offline pre-training and task-specific online adaptation.
Problem

Research questions and friction points this paper is trying to address.

Enhances offline RL with prompt-tuning bandits
Improves multi-task performance without model changes
Dynamically constructs high-performance trajectory prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit-based prompt-tuning method
Dynamic trajectory prompt construction
Enhanced multi-task offline learning
๐Ÿ”Ž Similar Papers
No similar papers found.