On Extending Direct Preference Optimization to Accommodate Ties

📅 2024-09-25

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Traditional direct preference optimization (DPO) enforces binary preferences and discards tie samples, leading to signal loss and degraded performance. To address this, this work introduces Rao-Kupper and Davidson tie-handling mechanisms into the DPO framework for probabilistic modeling of ties in pairwise comparisons—marking the first such integration. Theoretical analysis reveals that incorporating ties strengthens KL regularization and improves constraint fidelity to the reference policy. Empirical evaluation on neural machine translation and text summarization shows that augmenting training with human-annotated tie data yields stable or improved model performance and significantly reduced KL divergence. Crucially, the method requires no architectural modifications or changes to the training pipeline, ensuring strong compatibility with existing DPO implementations. This work establishes a new paradigm for fine-grained modeling of human feedback in preference learning, enabling more faithful utilization of nuanced preference signals—including ties—without increasing system complexity.

Technology Category

Application Category

📝 Abstract

We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them.

Problem

Research questions and friction points this paper is trying to address.

Extending Direct Preference Optimization to handle tied pairwise comparisons

Modeling tie probabilities as alternatives to clear preference labels

Improving performance over standard DPO by including ties in optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended DPO with tie modeling

Replaced Bradley-Terry with tie probability models

Added explicit ties to datasets without degradation

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization