π€ AI Summary
Offline RLHF methods often suffer from over-optimization due to reward misspecification, causing LLMs to deviate from true human preferences. To address this, we propose DRO-REBELβa distributionally robust offline RLHF framework grounded in Distributionally Robust Optimization (DRO). It is the first to unify Wasserstein, KL, and ΟΒ² divergence-based ambiguity sets within the REBEL update paradigm. Leveraging Fenchel duality, DRO-REBEL transforms the inherently complex robust optimization into a tractable relative reward regression problem, with theoretical guarantees on minimax-optimal convergence rates. Crucially, it eliminates the need for PPO-style clipping or auxiliary value networks, enabling efficient large-scale LLM training. Extensive evaluation across sentiment alignment, ArmoRM, and HH-Alignment benchmarks demonstrates strong robustness across model scales and dataset sizes. Among variants, ΟΒ²-REBEL achieves the best trade-off between performance and computational efficiency.
π Abstract
Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where models overfit to reward misspecification and drift from preferred behaviors observed during training. We introduce DRO-REBEL, a unified family of robust REBEL updates with type-$p$ Wasserstein, KL, and $Ο^2$ ambiguity sets. Using Fenchel duality, each update reduces to a simple relative-reward regression, preserving scalability and avoiding PPO-style clipping or auxiliary value networks. Under standard linear-reward and log-linear policy classes with a data-coverage condition, we establish $O(n^{-1/4})$ estimation bounds with tighter constants than prior DRO-DPO approaches, and recover the minimax-optimal $O(n^{-1/2})$ rate via a localized Rademacher complexity analysis. The same analysis closes the gap for Wasserstein-DPO and KL-DPO, showing both also attain optimal parametric rates. We derive practical SGD algorithms for all three divergences: gradient regularization (Wasserstein), importance weighting (KL), and a fast 1-D dual solve ($Ο^2$). Experiments on Emotion Alignment, the large-scale ArmoRM multi-objective benchmark, and HH-Alignment demonstrate strong worst-case robustness across unseen preference mixtures, model sizes, and data scales, with $Ο^2$-REBEL showing consistently strong empirical performance. A controlled radius--coverage study validates a no-free-lunch trade-off: radii shrinking faster than empirical divergence concentration rates achieve minimax-optimal parametric rates but forfeit coverage, while coverage-guaranteeing radii incur $O(n^{-1/4})$ rates.