🤖 AI Summary
This work addresses preference-based reinforcement learning (PbRL) for generative AI, where a policy is jointly optimized using both offline preference data and online multi-round interactions. Existing methods suffer from poor robustness under skewed offline data and lack theoretical guarantees. To bridge this gap, we establish the first Bayesian regret upper bound for the offline-online hybrid PbRL setting. We propose PSPL (Preference-based Sampling with Posterior Learning), a novel algorithm that decouples posterior modeling of reward and dynamics, and integrates posterior sampling with a Top-Two Thompson Sampling mechanism to enable efficient, preference-driven policy exploration. We provide a rigorous theoretical analysis proving that PSPL achieves a clean, sublinear Bayesian regret bound. Empirical evaluation across multiple generative interactive tasks demonstrates that PSPL significantly outperforms state-of-the-art baselines, validating its theoretical soundness and practical effectiveness.
📝 Abstract
We address the problem of best policy identification in preference-based reinforcement learning (PbRL), where learning occurs from noisy binary preferences over trajectory pairs rather than explicit numerical rewards. This approach is useful for post-training optimization of generative AI models during multi-turn user interactions, where preference feedback is more robust than handcrafted reward models. In this setting, learning is driven by both an offline preference dataset -- collected from a rater of unknown 'competence' -- and online data collected with pure exploration. Since offline datasets may exhibit out-of-distribution (OOD) biases, principled online data collection is necessary. To address this, we propose Posterior Sampling for Preference Learning ($mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling, that maintains independent posteriors over the true reward model and transition dynamics. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret of $mathsf{PSPL}$. Since the exact algorithm can be computationally impractical, we also provide an approximate version that outperforms existing baselines.