Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of traditional Direct Preference Optimization (DPO), which relies on pairwise preferences and struggles with set-wise multi-positive feedback commonly encountered in recommender systems. The authors propose Mult-DPO, the first extension of DPO to set-wise preference settings, by constructing a proxy likelihood function based on a multinomial distribution to directly align multiple candidates in a reward-induced weighting space. This approach yields a classification-style objective with a closed-form solution, proven to be a tractable upper bound of the Plackett–Luce marginalization loss. Theoretical analysis reveals that the tightness of this bound is governed by the ratio of total weights assigned to positive versus negative samples, thereby justifying the incorporation of richer or more challenging negatives. Experiments demonstrate that Mult-DPO effectively aligns large language models for recommendation tasks, achieving significant performance gains.

📝 Abstract

Direct preference optimization (DPO) is a simple and effective alignment strategy for large language models (LLMs) based on pairwise preferences. In recommender systems, however, user feedback is rarely pairwise. For a given context, e.g., a user, a session, or a conversation, we typically observe set-wise preferences with multiple positive items, where every positive item should outrank every unobserved or explicitly negative item, with no prescribed order among the positives or the negatives themselves. A natural generalization is to use the Plackett-Luce (PL) reward model, which extends the Bradley-Terry reward model underlying vanilla DPO from pairwise preferences to full rankings of candidates. However, we show that adapting the PL model to set-wise preferences requires marginalizing over all positive orderings, where the resulting expression is combinatorial in complexity. To address this fundamental challenge, we propose Mult-DPO, a novel DPO objective with a tractable multinomial surrogate likelihood over set-wise preference events for the user-preference alignment of LLM-based recommender systems. The multinomial construction is not itself a ranking distribution, but it is defined on the same reward-induced weight space and admits a closed-form DPO-style objective, enabling direct alignment of LLMs with multiple candidates through a classification-style objective. In addition, we prove that the multinomial DPO loss is a tractable upper bound on the marginalized PL DPO loss when optimizing against the set-wise preference data. We further characterize the tightness of this bound in terms of the relative total weight of positives versus negatives, which provides insights into tightening the bound with richer or harder negatives. Finally, we extend Mult-DPO to the alignment of LLMs with multiple preference levels. Code is available at https://github.com/yaochenzhu/Mult_DPO

Problem

Research questions and friction points this paper is trying to address.

set-wise preferences

direct preference optimization

recommender systems

Plackett-Luce model

multinomial surrogate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multinomial Direct Preference Optimization

Set-wise Preference

Plackett-Luce Model