When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Instance-dependent preference reversal—where human annotators inconsistently rank the same response pair across different contexts—is prevalent in preference labeling, severely degrading RLHF data quality and undermining alignment robustness. Method: We propose a robust alignment framework for noisy preferences: (i) we explicitly model instance-specific reversal probability and integrate it into the Bradley–Terry preference likelihood; (ii) we design an iterative, DPO-compatible robust optimization algorithm that jointly leverages human intent decomposition and feature-driven uncertainty estimation. Crucially, our method requires no modification to existing training pipelines. Contribution/Results: Extensive experiments under varying reversal intensities demonstrate that our approach significantly outperforms baselines in preference accuracy, training stability, and convergence speed—validating both its effectiveness and practical deployability in real-world RLHF settings.

Technology Category

Application Category

📝 Abstract

Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses data corruption from human preference flipping in LLM alignment.

Introduces a robust RLHF algorithm to handle uncertain and flipped annotations.

Models instance-dependent flipping probabilities to improve alignment algorithm robustness.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FA-DPO algorithm for preference flipping

Models flipping probability using Bradley-Terry model

Uses iterative optimization compatible with RLHF

🔎 Similar Papers

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization