🤖 AI Summary
In offline reinforcement learning, Prompting Decision Transformers (PDTs) suffer from weak task discrimination due to uniform trajectory prompt sampling during pretraining. Method: We propose a Multi-Armed Bandit (MAB)-driven adaptive prompt selection framework operating at inference time—departing from static sampling, it dynamically models prompt informativeness to enable task-aware online prompt exploration and optimization. Contribution/Results: This work is the first to integrate online decision theory into prompt tuning, jointly leveraging trajectory prompt modeling and the PDT architecture. On multi-task benchmarks, it achieves a +12.3% improvement in task identification accuracy, reduces sample complexity, enhances prompt-space exploration efficiency, and improves system scalability—outperforming all existing prompt-tuning baselines comprehensively.
📝 Abstract
Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline reinforcement learning pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: Not all prompts are equally informative for differentiating between tasks. To address this, we propose an inference time bandit-based prompt-tuning framework that explores and optimizes trajectory prompt selection to enhance task performance. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines.