🤖 AI Summary
Instance-dependent preference reversal—where human annotators inconsistently rank the same response pair across different contexts—is prevalent in preference labeling, severely degrading RLHF data quality and undermining alignment robustness. Method: We propose a robust alignment framework for noisy preferences: (i) we explicitly model instance-specific reversal probability and integrate it into the Bradley–Terry preference likelihood; (ii) we design an iterative, DPO-compatible robust optimization algorithm that jointly leverages human intent decomposition and feature-driven uncertainty estimation. Crucially, our method requires no modification to existing training pipelines. Contribution/Results: Extensive experiments under varying reversal intensities demonstrate that our approach significantly outperforms baselines in preference accuracy, training stability, and convergence speed—validating both its effectiveness and practical deployability in real-world RLHF settings.
📝 Abstract
Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.