DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Offline RLHF methods often suffer from over-optimization due to reward misspecification, causing LLMs to deviate from true human preferences. To address this, we propose DRO-REBEL—a distributionally robust offline RLHF framework grounded in Distributionally Robust Optimization (DRO). It is the first to unify Wasserstein, KL, and χ² divergence-based ambiguity sets within the REBEL update paradigm. Leveraging Fenchel duality, DRO-REBEL transforms the inherently complex robust optimization into a tractable relative reward regression problem, with theoretical guarantees on minimax-optimal convergence rates. Crucially, it eliminates the need for PPO-style clipping or auxiliary value networks, enabling efficient large-scale LLM training. Extensive evaluation across sentiment alignment, ArmoRM, and HH-Alignment benchmarks demonstrates strong robustness across model scales and dataset sizes. Among variants, χ²-REBEL achieves the best trade-off between performance and computational efficiency.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where models overfit to reward misspecification and drift from preferred behaviors observed during training. We introduce DRO-REBEL, a unified family of robust REBEL updates with type-$p$ Wasserstein, KL, and $χ^2$ ambiguity sets. Using Fenchel duality, each update reduces to a simple relative-reward regression, preserving scalability and avoiding PPO-style clipping or auxiliary value networks. Under standard linear-reward and log-linear policy classes with a data-coverage condition, we establish $O(n^{-1/4})$ estimation bounds with tighter constants than prior DRO-DPO approaches, and recover the minimax-optimal $O(n^{-1/2})$ rate via a localized Rademacher complexity analysis. The same analysis closes the gap for Wasserstein-DPO and KL-DPO, showing both also attain optimal parametric rates. We derive practical SGD algorithms for all three divergences: gradient regularization (Wasserstein), importance weighting (KL), and a fast 1-D dual solve ($χ^2$). Experiments on Emotion Alignment, the large-scale ArmoRM multi-objective benchmark, and HH-Alignment demonstrate strong worst-case robustness across unseen preference mixtures, model sizes, and data scales, with $χ^2$-REBEL showing consistently strong empirical performance. A controlled radius--coverage study validates a no-free-lunch trade-off: radii shrinking faster than empirical divergence concentration rates achieve minimax-optimal parametric rates but forfeit coverage, while coverage-guaranteeing radii incur $O(n^{-1/4})$ rates.

Problem

Research questions and friction points this paper is trying to address.

Overoptimization in offline RLHF causes model drift from preferred behaviors

Existing approaches overfit to reward misspecification during LLM alignment

Need for robust methods that maintain performance across diverse preference scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified robust REBEL updates with Wasserstein, KL, and χ² ambiguity sets

Simple relative-reward regression via Fenchel duality, avoiding complex networks

Practical SGD algorithms with gradient regularization and importance weighting

🔎 Similar Papers

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment