🤖 AI Summary
Standard RLHF compresses diverse human preferences into a single reward signal, overlooking the coexistence of multiple valid responses in structurally pluralistic societies and thereby inducing alignment distortion. This work introduces, for the first time, the concept of “preference validity compression,” using Malaysia’s multicultural context as a case study, and advocates that alignment methods should satisfy “validity-preserving consistency.” By modeling preference events through trio-annotation prompts and multi-participant acceptability judgments across 321 scenarios, the study finds that 79% of prompts admit multiple majority-supported valid responses. Incorporating all such options substantially narrows the apparent performance gap among top responses, revealing a significant measurement bias inherent in conventional aggregation mechanisms when applied to pluralistic settings.
📝 Abstract
Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.