Towards bandit-based prompt-tuning for in-the-wild foundation agents

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In offline reinforcement learning, Prompting Decision Transformers (PDTs) suffer from weak task discrimination due to uniform trajectory prompt sampling during pretraining. Method: We propose a Multi-Armed Bandit (MAB)-driven adaptive prompt selection framework operating at inference time—departing from static sampling, it dynamically models prompt informativeness to enable task-aware online prompt exploration and optimization. Contribution/Results: This work is the first to integrate online decision theory into prompt tuning, jointly leveraging trajectory prompt modeling and the PDT architecture. On multi-task benchmarks, it achieves a +12.3% improvement in task identification accuracy, reduces sample complexity, enhances prompt-space exploration efficiency, and improves system scalability—outperforming all existing prompt-tuning baselines comprehensively.

Technology Category

Application Category

📝 Abstract

Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline reinforcement learning pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: Not all prompts are equally informative for differentiating between tasks. To address this, we propose an inference time bandit-based prompt-tuning framework that explores and optimizes trajectory prompt selection to enhance task performance. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines.

Problem

Research questions and friction points this paper is trying to address.

Optimizing trajectory prompt selection

Enhancing task performance with bandit-based tuning

Improving sample complexity and scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit-based prompt-tuning framework

Optimizes trajectory prompt selection

Enhances task performance significantly

🔎 Similar Papers

No similar papers found.