π€ AI Summary
This study addresses three key challenges in evaluating large language models: insufficient subject coverage, inefficient annotation incentives, and unstable rankings. The authors construct a comprehensive benchmark comprising 899 tasks spanning 261 fine-grained disciplines and propose a greedy optimization algorithm grounded in expert anchors, which offers a (1β1/e)-approximation guarantee to enhance disciplinary representativeness. They introduce a performance-based bonus mechanism and theoretically prove its weak first-order stochastic dominance over fixed-wage schemes in annotation quality. Using bootstrap resampling, they quantify ranking stability under limited budgets. Evaluation of 42 models reveals a three-tier performance hierarchy, with the top-performing model, Gemini-3.1-Pro-Preview, achieving 53.17%, and tool augmentation yielding at most a 5.17-point gain. Stability intervals are explicitly reported to prevent misinterpretation of rank orderings.
π Abstract
Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.