PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the critical dependence of large language model (LLM) evaluators on scoring rubrics for open-ended responses, noting that ambiguous rubrics often reward superficially fluent yet factually incorrect or intent-misaligned answers. To mitigate this, the authors propose PReMISE, a framework that formalizes rubrics into auditable measurement specifications by automatically deriving strategy-level rule sets from human pairwise preference data. The framework introduces a four-dimensional audit—assessing structural adequacy, reliability, preference fit, and adversarial robustness—and incorporates two refinement techniques: preference-ranking selection and reliability-constrained refinement. These innovations improve evaluator accuracy from 65.0% to 68.6% and reduce high-score abuse from 46.4% to 36.0%, all while maintaining inter-evaluator consistency (Krippendorff’s α ≈ 0.52), thereby demonstrating that high consistency does not necessarily imply low exploitability.

📝 Abstract

LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability. PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65.0\%$ to $68.6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.4\%$ to $36.0\%$ with little change in inter-judge agreement ($α{=}.531\to.519$).

Problem

Research questions and friction points this paper is trying to address.

LLM judges

rubrics

measurement specifications

preference evaluation

adversarial robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric auditing

LLM judges

preference learning