🤖 AI Summary
This work addresses the critical dependence of large language model (LLM) evaluators on scoring rubrics for open-ended responses, noting that ambiguous rubrics often reward superficially fluent yet factually incorrect or intent-misaligned answers. To mitigate this, the authors propose PReMISE, a framework that formalizes rubrics into auditable measurement specifications by automatically deriving strategy-level rule sets from human pairwise preference data. The framework introduces a four-dimensional audit—assessing structural adequacy, reliability, preference fit, and adversarial robustness—and incorporates two refinement techniques: preference-ranking selection and reliability-constrained refinement. These innovations improve evaluator accuracy from 65.0% to 68.6% and reduce high-score abuse from 46.4% to 36.0%, all while maintaining inter-evaluator consistency (Krippendorff’s α ≈ 0.52), thereby demonstrating that high consistency does not necessarily imply low exploitability.
📝 Abstract
LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability. PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65.0\%$ to $68.6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.4\%$ to $36.0\%$ with little change in inter-judge agreement ($α{=}.531\to.519$).