Knowledge Index of Noah's Ark

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses three key challenges in evaluating large language models: insufficient subject coverage, inefficient annotation incentives, and unstable rankings. The authors construct a comprehensive benchmark comprising 899 tasks spanning 261 fine-grained disciplines and propose a greedy optimization algorithm grounded in expert anchors, which offers a (1−1/e)-approximation guarantee to enhance disciplinary representativeness. They introduce a performance-based bonus mechanism and theoretically prove its weak first-order stochastic dominance over fixed-wage schemes in annotation quality. Using bootstrap resampling, they quantify ranking stability under limited budgets. Evaluation of 42 models reveals a three-tier performance hierarchy, with the top-performing model, Gemini-3.1-Pro-Preview, achieving 53.17%, and tool augmentation yielding at most a 5.17-point gain. Stability intervals are explicitly reported to prevent misinterpretation of rank orderings.

📝 Abstract

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

Problem

Research questions and friction points this paper is trying to address.

knowledge benchmark

disciplinary representativeness

annotation incentive

ranking stability

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

disciplinary representativeness

incentive-compatible annotation

coverage approximation