Enhancing Pre-Trained Decision Transformers with Prompt-Tuning Bandits

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing Prompting Decision Transformers (PDTs) employ uniform random sampling of task-conditioning trajectories for prompting in offline multi-task reinforcement learning, ignoring inter-trajectory differences in informativeness and thereby degrading generalization. This paper proposes a lightweight bandit-based prompt tuning method—the first to integrate online multi-armed bandit mechanisms into offline RL prompt optimization. Without fine-tuning the backbone model, it dynamically selects high-informativeness trajectories to construct task-specific prompts. The approach establishes an efficient bridge between generic pre-trained models and task-specific adaptation. Experiments on standard benchmarks and a newly constructed multi-task offline RL environment demonstrate substantial improvements in downstream task performance. Results empirically validate that prompt quality is a critical determinant of Decision Transformer generalization, highlighting the efficacy of adaptive, information-aware prompting over static or random strategies.

Technology Category

Application Category

📝 Abstract

Harnessing large offline datasets is vital for training foundation models that can generalize across diverse tasks. Offline Reinforcement Learning (RL) offers a powerful framework for these scenarios, enabling the derivation of optimal policies even from suboptimal data. The Prompting Decision Transformer (PDT) is an offline RL multi-task model that distinguishes tasks through stochastic trajectory prompts, which are task-specific tokens maintained in context during rollouts. However, PDT samples these tokens uniformly at random from per-task demonstration datasets, failing to account for differences in token informativeness and potentially leading to performance degradation. To address this limitation, we introduce a scalable bandit-based prompt-tuning method that dynamically learns to construct high-performance trajectory prompts. Our approach significantly enhances downstream task performance without modifying the pre-trained Transformer backbone. Empirical results on benchmark tasks and a newly designed multi-task environment demonstrate the effectiveness of our method, creating a seamless bridge between general multi-task offline pre-training and task-specific online adaptation.

Problem

Research questions and friction points this paper is trying to address.

Enhances offline RL with prompt-tuning bandits

Improves multi-task performance without model changes

Dynamically constructs high-performance trajectory prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit-based prompt-tuning method

Dynamic trajectory prompt construction

Enhanced multi-task offline learning

🔎 Similar Papers

Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer