Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

225K/year
πŸ€– AI Summary
This work addresses the limitation of existing preference modeling approaches, which typically assume a unified reward function and thus struggle to capture the diversity and heterogeneity of human preferences, hindering personalization. To overcome this, the authors propose a sparse Mixture-of-Experts (MoE) reward model trained on pairwise preference data. By incorporating a sparse routing mechanism and an expert diversity regularization term, the model encourages each expert to specialize in distinct preference patterns, achieving functional disentanglement. This approach not only enhances model interpretability and personalization capabilities but also enables the learning of semantically coherent expert specializations and routing weights that dynamically adapt to user preferences. Experimental results demonstrate that the proposed method significantly outperforms baseline models at test time.
πŸ“ Abstract
Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.
Problem

Research questions and friction points this paper is trying to address.

preference modeling
personalization
reward models
human feedback
Mixture-of-Experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts
Preference Modeling
Interpretable Experts
Personalized RLHF
Expert Diversity
πŸ”Ž Similar Papers