Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses alignment distortion in large language models under heterogeneous user preferences, identifying that mainstream methods (e.g., RLHF, DPO) implicitly assume a single, homogeneous preference distribution and thus fail to guarantee average user utility. To formalize this issue, we introduce—*for the first time in AI alignment*—social choice theory, proposing “alignment distortion” as the worst-case ratio between the average utility of the learned policy and that of the optimal average-utility policy. We then design Nash-LfHF: a novel alignment framework leveraging Bradley–Terry individual preference modeling, Nash equilibrium selection, KL-divergence regularization, and worst-case analysis. Nash-LfHF achieves minimax-optimal distortion of ( frac{1}{2} + o(1) ) times the baseline distortion parameter ( eta ), whereas RLHF and DPO suffer distortion at least ( (1 - o(1))eta ) or even unbounded distortion, exposing their fundamental limitations in multi-preference settings.

Technology Category

Application Category

📝 Abstract
After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of $(frac{1}{2} + o(1)) cdot eta$ (for the BT temperature $eta$), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer $geq (1 - o(1)) cdot eta$ distortion already without a KL constraint, and $e^{Omega(eta)}$ or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.
Problem

Research questions and friction points this paper is trying to address.

Evaluates distortion in AI alignment methods for diverse user preferences
Compares RLHF and DPO performance under worst-case utility ratios
Proposes Nash Learning for optimal distortion across utility distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Nash Learning for optimal distortion
Models user preferences with Bradley-Terry
Compares RLHF and DPO distortion rates
🔎 Similar Papers
No similar papers found.