🤖 AI Summary
This work addresses the fundamental challenge of aligning AI systems with heterogeneous human preferences, departing from the conventional assumption of a single reward function. Method: We formally model preference heterogeneity and prove that the average reward is the optimal alignment objective for a single policy. We propose a direct policy optimization framework based on user-type modeling, design a novel loss function under information constraints, and conduct theoretical analyses of convergence and sample efficiency. Contributions/Results: (1) A fundamental trade-off exists between statistical consistency and sample efficiency in direct alignment; (2) First-order performance improvement is achievable using only minimal labeled feedback; (3) Consistent learning of the optimal policy is possible with full-type feedback, yet no direct loss function simultaneously achieves both consistency and sample efficiency. Our results establish a theoretical foundation and methodological paradigm for trustworthy alignment under preference heterogeneity.
📝 Abstract
Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.