Learning Correlated Reward Models: Statistical Barriers and Opportunities

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The widely adopted Independence of Irrelevant Alternatives (IIA) assumption in Random Utility Models (RUMs) yields coarse preference representations, failing to capture realistic correlation structures among latent utilities. Method: This work studies the correlated probit RUM—a more behaviorally plausible alternative—and establishes its learnability from limited preference data. Contribution/Results: We prove that pairwise preference data are insufficient for identifying utility correlations, whereas best-of-three (triplet-wise) comparisons uniquely identify them and enable statistically optimal estimation. Building on this, we propose the first efficient estimator with guaranteed convergence and polynomial-time complexity, establishing both its statistical optimality and computational feasibility. Experiments across multiple real-world datasets demonstrate significant improvements in preference modeling accuracy and personalization capability, approaching the theoretical performance limit.

Technology Category

Application Category

📝 Abstract
Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses emph{all} human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a emph{correlated} probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is emph{fundamentally insufficient} to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that emph{best-of-three} preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.
Problem

Research questions and friction points this paper is trying to address.

Learning correlated reward models without IIA assumption limitations
Overcoming statistical barriers in human preference modeling
Developing efficient estimators using higher-order preference data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses correlated probit model to avoid IIA assumption
Employs best-of-three preference data for learning correlations
Develops efficient estimator for fine-grained preference modeling
🔎 Similar Papers
No similar papers found.