🤖 AI Summary
Existing reward models struggle to capture users’ fine-grained preferences under data-scarce and cross-domain settings. To address this, we propose IPRM—the first inference-based personalized reward modeling framework—which disentangles and models individual preference factors using only 1–3 user feedback samples. Methodologically, IPRM employs a synthetic-data-augmented two-stage training pipeline: (i) supervised fine-tuning to learn personalized preference representations, followed by (ii) reinforcement fine-tuning to refine the decision boundary. Evaluated across multiple domains, IPRM significantly outperforms same-scale baselines in preference accuracy and cross-domain generalization—matching or exceeding the performance of substantially larger models—while drastically reducing both data requirements and computational cost for personalized alignment.
📝 Abstract
Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.