Best Policy Learning from Trajectory Preference Feedback

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses preference-based reinforcement learning (PbRL) for generative AI, where a policy is jointly optimized using both offline preference data and online multi-round interactions. Existing methods suffer from poor robustness under skewed offline data and lack theoretical guarantees. To bridge this gap, we establish the first Bayesian regret upper bound for the offline-online hybrid PbRL setting. We propose PSPL (Preference-based Sampling with Posterior Learning), a novel algorithm that decouples posterior modeling of reward and dynamics, and integrates posterior sampling with a Top-Two Thompson Sampling mechanism to enable efficient, preference-driven policy exploration. We provide a rigorous theoretical analysis proving that PSPL achieves a clean, sublinear Bayesian regret bound. Empirical evaluation across multiple generative interactive tasks demonstrates that PSPL significantly outperforms state-of-the-art baselines, validating its theoretical soundness and practical effectiveness.

Technology Category

Application Category

📝 Abstract
We address the problem of best policy identification in preference-based reinforcement learning (PbRL), where learning occurs from noisy binary preferences over trajectory pairs rather than explicit numerical rewards. This approach is useful for post-training optimization of generative AI models during multi-turn user interactions, where preference feedback is more robust than handcrafted reward models. In this setting, learning is driven by both an offline preference dataset -- collected from a rater of unknown 'competence' -- and online data collected with pure exploration. Since offline datasets may exhibit out-of-distribution (OOD) biases, principled online data collection is necessary. To address this, we propose Posterior Sampling for Preference Learning ($mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling, that maintains independent posteriors over the true reward model and transition dynamics. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret of $mathsf{PSPL}$. Since the exact algorithm can be computationally impractical, we also provide an approximate version that outperforms existing baselines.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Optimal Strategy
Interactive Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference-based Reinforcement Learning
Data Skew Handling
Theoretical Performance Bounds
🔎 Similar Papers
No similar papers found.