🤖 AI Summary
To address the poor stability of Reinforcement Learning from Human Feedback (RLHF) and the performance degradation of Direct Preference Optimization (DPO) under ambiguous preferences, this paper proposes the Distillation-Regularized Dual Optimization (DRDO) framework, which jointly optimizes reward distillation and preference learning. DRDO is the first method to co-model reward fitting and preference learning end-to-end, integrating a reward distillation loss with an enhanced Bradley–Terry preference likelihood—eliminating the need for a separate reward model. Crucially, it introduces a noise-robust preference likelihood formulation, significantly improving resilience to uncertain or noisy preferences and out-of-distribution (OOD) scenarios. Evaluated on UltraFeedback and TL;DR datasets, DRDO consistently outperforms both DPO and entropy-regularized DPO (e-DPO), achieving higher expected reward and superior robustness across diverse preference quality conditions.
📝 Abstract
Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods which rely heavily on the Bradley-Terry-based pairwise preference formulation can still lead to degenerate policies when challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs with low confidence. This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood formulation. Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.