SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing RLHF methods rely on a single, unified reward model, failing to capture preference heterogeneity across demographic groups and thus exhibiting bias toward dominant populations. While MaxMin-RLHF introduces group-specific reward modeling and optimizes for the worst-off group, it collapses when minority-group rewards fall below a critical threshold. This paper proposes SharedRep-RLHF—a novel framework that explicitly learns a cross-group shared representation, decoupling group-difference modeling from fairness-aware optimization and theoretically guaranteeing robust optimization for underrepresented groups. By integrating shared representation learning with multi-group preference modeling, SharedRep-RLHF alleviates the “minimum-reward bottleneck” inherent in worst-case optimization. Experiments across multiple NLP tasks demonstrate that SharedRep-RLHF achieves up to a 20% higher human preference win rate than MaxMin-RLHF, significantly improving both fairness and generalization performance.

Technology Category

Application Category

📝 Abstract

Uniform-reward reinforcement learning from human feedback (RLHF), which trains a single reward model to represent the preferences of all annotators, fails to capture the diversity of opinions across sub-populations, inadvertently favoring dominant groups. The state-of-the-art, MaxMin-RLHF, addresses this by learning group-specific reward models, and by optimizing for the group receiving the minimum reward, thereby promoting fairness. However, we identify that a key limitation of MaxMin-RLHF is its poor performance when the minimum-reward group is a minority. To mitigate this drawback, we introduce a novel framework, termed {em SharedRep-RLHF}. At its core, SharedRep-RLHF learns and leverages {em shared traits} in annotations among various groups, in contrast to learning separate reward models across groups. We first show that MaxMin-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity of SharedRep-RLHF. Experiments across diverse natural language tasks showcase the effectiveness of SharedRep-RLHF compared to MaxMin-RLHF with a gain of up to 20% in win rate.

Problem

Research questions and friction points this paper is trying to address.

Addresses diversity in human preferences for RLHF

Improves fairness for minority groups in reward modeling

Leverages shared traits across groups to enhance performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages shared traits among annotation groups

Learns single reward model with shared representation

Improves performance for minority preference groups

🔎 Similar Papers

No similar papers found.

Authors to Follow