Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the challenge that existing reward models rely on diverse and heterogeneous evaluation criteria—such as rule-based verifiers, reference answers, and scoring rubrics—without a unified integration mechanism. The authors propose reframing reward modeling as reusable “reward evaluation skills,” wherein a structured agent dynamically selects and aggregates multi-source evidence to enable customized, consistent, and transparent assessment. This approach uniquely unifies heterogeneous evaluation criteria within an agent-based skill framework, yielding a dynamic, interpretable, and task-adaptive reward modeling architecture. Experimental results demonstrate that the proposed method significantly outperforms conventional discriminative baselines across multiple reward benchmarks and downstream tasks—including Best-of-N selection and reinforcement learning—thereby validating its effectiveness and generalizability.
📝 Abstract
Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.
Problem

Research questions and friction points this paper is trying to address.

reward modeling
heterogeneous evaluation criteria
large language models
reinforcement learning
unified framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Skill-RM
reward modeling
heterogeneous evaluation
agentic reasoning
dynamic evidence orchestration