🤖 AI Summary
This work addresses the challenge of selecting calibration methods to balance estimation cost and interaction modeling capacity for large language model (LLM) judge panels under limited human annotation budgets. The authors propose a “limited-calibration mechanism spectrum” and a deployable judge panel selection framework that systematically analyzes trade-offs among judge pathways, prefix lengths, and aggregator families. Leveraging low-dimensional stackers and joint output table calibration—combined with scalar/reliability aggregation and parametric estimation diagnostics—the approach is evaluated on four benchmarks including RewardBench. Results show scalar aggregation outperforms alternatives in 16 out of 20 dataset–budget configurations. Notably, when sixth-order interactions persist without vanishing patterns, joint table calibration reduces test MSE from 0.224 to 0.061, revealing predominantly additive or redundant judge outputs across most datasets while identifying critical scenarios necessitating high-order joint modeling.
📝 Abstract
We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.