🤖 AI Summary
Traditional direct preference optimization (DPO) enforces binary preferences and discards tie samples, leading to signal loss and degraded performance. To address this, this work introduces Rao-Kupper and Davidson tie-handling mechanisms into the DPO framework for probabilistic modeling of ties in pairwise comparisons—marking the first such integration. Theoretical analysis reveals that incorporating ties strengthens KL regularization and improves constraint fidelity to the reference policy. Empirical evaluation on neural machine translation and text summarization shows that augmenting training with human-annotated tie data yields stable or improved model performance and significantly reduced KL divergence. Crucially, the method requires no architectural modifications or changes to the training pipeline, ensuring strong compatibility with existing DPO implementations. This work establishes a new paradigm for fine-grained modeling of human feedback in preference learning, enabling more faithful utilization of nuanced preference signals—including ties—without increasing system complexity.
📝 Abstract
We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them.