🤖 AI Summary
This paper addresses alignment distortion in large language models under heterogeneous user preferences, identifying that mainstream methods (e.g., RLHF, DPO) implicitly assume a single, homogeneous preference distribution and thus fail to guarantee average user utility. To formalize this issue, we introduce—*for the first time in AI alignment*—social choice theory, proposing “alignment distortion” as the worst-case ratio between the average utility of the learned policy and that of the optimal average-utility policy. We then design Nash-LfHF: a novel alignment framework leveraging Bradley–Terry individual preference modeling, Nash equilibrium selection, KL-divergence regularization, and worst-case analysis. Nash-LfHF achieves minimax-optimal distortion of ( frac{1}{2} + o(1) ) times the baseline distortion parameter ( eta ), whereas RLHF and DPO suffer distortion at least ( (1 - o(1))eta ) or even unbounded distortion, exposing their fundamental limitations in multi-preference settings.
📝 Abstract
After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of $(frac{1}{2} + o(1)) cdot eta$ (for the BT temperature $eta$), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer $geq (1 - o(1)) cdot eta$ distortion already without a KL constraint, and $e^{Omega(eta)}$ or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.