π€ AI Summary
This work addresses the lack of a unified framework in existing LLM evaluation methods based on rubrics, which suffer from terminological inconsistency and fragmented implementations. We propose Autorubric, an open-source framework that systematically integrates analytic rubrics, single- and multi-rater aggregation, few-shot calibration, bias mitigation, and psychometric reliability metrics such as Cohenβs ΞΊ. Autorubric supports binary, ordinal, and nominal criteria and provides actionable default configurations. Experimental results demonstrate that Autorubric achieves 80% and 87% accuracy on RiceChem and CHARM-100, respectively; successfully improves peer-review agent scores from 0.47 to 0.85; and significantly enhances AdvancedIF performance via RL-based rewards (+0.039, p=0.032).
π Abstract
Techniques for reliable rubric-based LLM evaluation -- ensemble judging, bias mitigation, few-shot calibration -- are scattered across papers with inconsistent terminology and partial implementations. We introduce Autorubric, an open-source framework that unifies these rubric-based LLM evaluation lessons with opinionated defaults: analytic rubrics with binary, ordinal, and nominal criteria; single-judge and ensemble evaluation; few-shot calibration; bias mitigations; and psychometric reliability metrics. We validate on three benchmarks: RiceChem (college chemistry grading, 80\% accuracy with 5-shot calibration), ResearcherBench (deep research evaluation, 931 criteria, cross-judge agreement analysis), and CHARM-100, a new chatbot evaluation dataset combining all three criterion types with ground truth labels (87\% binary accuracy, moderate-to-substantial $\kappa$). Beyond measurement, per-criterion scores and explanations serve as optimization signals. We demonstrate how Autorubric's rubric-evaluation explanations raise a peer review agent's score from 0.47 to 0.85 (above the 0.82 expert-curated baseline), and its scores serve as RL rewards to produce statistically significant improvement on AdvancedIF (+0.039, Wilcoxon $p = 0.032$) with positive transfer to IFEval. In all of these cases, Autorubric enabled us to rapidly operationalize various rubric design choices and best practices with minimal effort.