RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation rubrics for large language models lack systematic approaches to diagnosing quality issues. This work proposes RIFT—the first taxonomy of rubric failure modes grounded in grounded theory—encompassing eight distinct failure types across three dimensions: reliability, content validity, and consequential validity. Accompanying this framework is a suite of automated diagnostic metrics designed to detect such failures. Empirical evaluation demonstrates strong inter-annotator agreement, with 87% pairwise consistency (mean Cohen’s κ = 0.64), and high alignment between automated metrics and human judgments, achieving a peak F1 score of 0.86. This study represents the first systematic modeling and scalable diagnosis of quality problems in LLM evaluation rubrics.
📝 Abstract
Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode Taxonomy, a taxonomy for systematically characterizing failure modes in rubric composition and design. RIFT consists of eight failure modes organized into three high-level categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures. RIFT is developed using grounded theory by iteratively annotating rubrics drawn from five diverse benchmarks spanning general instruction following, code generation, creative writing, and expert-level deep research, until no new failure modes are identified. We evaluate the consistency of the taxonomy by measuring agreement among independent human annotators, observing fair agreement overall (87% pairwise agreement and 0.64 average Cohen's kappa). Finally, to support scalable diagnosis, we propose automated rubric quality metrics and show that they align with human failure-mode annotations, achieving up to 0.86 F1.
Problem

Research questions and friction points this paper is trying to address.

rubric evaluation
failure mode
diagnostic methodology
LLM benchmarking
evaluation validity
Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric evaluation
failure mode taxonomy
automated diagnostics
LLM benchmarking
validity analysis
🔎 Similar Papers
No similar papers found.