PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models struggle to capture users’ fine-grained preferences under data-scarce and cross-domain settings. To address this, we propose IPRM—the first inference-based personalized reward modeling framework—which disentangles and models individual preference factors using only 1–3 user feedback samples. Methodologically, IPRM employs a synthetic-data-augmented two-stage training pipeline: (i) supervised fine-tuning to learn personalized preference representations, followed by (ii) reinforcement fine-tuning to refine the decision boundary. Evaluated across multiple domains, IPRM significantly outperforms same-scale baselines in preference accuracy and cross-domain generalization—matching or exceeding the performance of substantially larger models—while drastically reducing both data requirements and computational cost for personalized alignment.

Technology Category

Application Category

📝 Abstract
Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.
Problem

Research questions and friction points this paper is trying to address.

Captures nuanced user-specific preferences from limited data
Identifies personal factors from one or few exemplars
Enhances reward model generalization across diverse domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-based reward modeling framework
Synthetic data generation for limited data
Two-stage training with SFT and reinforcement
🔎 Similar Papers
No similar papers found.
Mengdi Li
Mengdi Li
King Abdullah University of Science and Technology
Reinforcement LearningLLMsRobotics
G
Guanqiao Chen
University of Science and Technology of China
X
Xufeng Zhao
University of Hamburg
Haochen Wen
Haochen Wen
University College London
Large Language ModelsXAIAI safetyReinforcement Learning
S
Shu Yang
Provable Responsible AI and Data Analytics Lab, King Abdullah University of Science and Technology
D
Di Wang
Provable Responsible AI and Data Analytics Lab, King Abdullah University of Science and Technology