Democratic Preference Alignment via Sortition-Weighted RLHF

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the limited demographic representativeness of current AI alignment methods based on human preferences, which often rely on convenience sampling and thus fail to capture diverse public values. To overcome this, we propose Democratic Preference Optimization (DemPO), a novel framework that integrates sortition—the algorithmic random selection mechanism used in citizens’ assemblies—into reinforcement learning from human feedback (RLHF). DemPO constructs demographically representative panels of annotators and explores two fine-tuning strategies: a “hard panel” using only sortition-selected participants and a “soft panel” that reweights preferences from a broader population to match demographic targets. We theoretically prove that the soft panel exactly recovers the expected optimization objective of the hard panel. Experiments on LLaMA models (1B–8B) across six value aggregation methods show that the hard panel consistently achieves the best performance, while the soft panel significantly outperforms unweighted baselines, with gains increasing at larger model scales—validating the efficacy of upfront representativeness design.

Technology Category

Application Category

📝 Abstract

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.

Problem

Research questions and friction points this paper is trying to address.

preference alignment

demographic representativeness

RLHF

sortition

value alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Democratic Preference Optimization

algorithmic sortition

representative sampling