PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing reward models struggle to capture users’ fine-grained preferences under data-scarce and cross-domain settings. To address this, we propose IPRM—the first inference-based personalized reward modeling framework—which disentangles and models individual preference factors using only 1–3 user feedback samples. Methodologically, IPRM employs a synthetic-data-augmented two-stage training pipeline: (i) supervised fine-tuning to learn personalized preference representations, followed by (ii) reinforcement fine-tuning to refine the decision boundary. Evaluated across multiple domains, IPRM significantly outperforms same-scale baselines in preference accuracy and cross-domain generalization—matching or exceeding the performance of substantially larger models—while drastically reducing both data requirements and computational cost for personalized alignment.

Technology Category

Application Category

📝 Abstract

Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.

Problem

Research questions and friction points this paper is trying to address.

Captures nuanced user-specific preferences from limited data

Identifies personal factors from one or few exemplars

Enhances reward model generalization across diverse domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-based reward modeling framework

Synthetic data generation for limited data

Two-stage training with SFT and reinforcement

🔎 Similar Papers

No similar papers found.

Authors to Follow