Knowledge Index of Noah's Ark

πŸ“… 2026-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

176K/year
πŸ€– AI Summary
This study addresses three key challenges in evaluating large language models: insufficient subject coverage, inefficient annotation incentives, and unstable rankings. The authors construct a comprehensive benchmark comprising 899 tasks spanning 261 fine-grained disciplines and propose a greedy optimization algorithm grounded in expert anchors, which offers a (1βˆ’1/e)-approximation guarantee to enhance disciplinary representativeness. They introduce a performance-based bonus mechanism and theoretically prove its weak first-order stochastic dominance over fixed-wage schemes in annotation quality. Using bootstrap resampling, they quantify ranking stability under limited budgets. Evaluation of 42 models reveals a three-tier performance hierarchy, with the top-performing model, Gemini-3.1-Pro-Preview, achieving 53.17%, and tool augmentation yielding at most a 5.17-point gain. Stability intervals are explicitly reported to prevent misinterpretation of rank orderings.
πŸ“ Abstract
Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.
Problem

Research questions and friction points this paper is trying to address.

knowledge benchmark
disciplinary representativeness
annotation incentive
ranking stability
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

disciplinary representativeness
incentive-compatible annotation
coverage approximation
ranking stability
knowledge benchmarking
πŸ”Ž Similar Papers
No similar papers found.
S
Sheng Jin
2077AI
M
Minghao Liu
2077AI
Yunze Xiao
Yunze Xiao
Language Technology Institute, Carnegie Mellon University
Natural Language ProcessingComputational Social ScienceAnthropomorphism
Z
Zeqi Zhou
Brown University
Heli Qi
Heli Qi
Waseda University, RIKEN
Multi-Modal Learning
Yifan Yao
Yifan Yao
Drexel University
Meishu Song
Meishu Song
The University of Tokyo
Deep LearningMultimodal UnderstandingMultitask LearningHealth InformaticsComputer Audition
Kaijing Ma
Kaijing Ma
Fudan University
Computer VisionMachine Learning
X
Xuan Zhang
2077AI
Sicong Jiang
Sicong Jiang
McGill University, 2077AI
Large Language ModelsVision Language ModelsAutonomous DrivingTrustworthy AI
Y
Yizhe Li
2077AI
N
Ningshan Ma
Massachusetts Institute of Technology
J
Jie Wei
2077AI
Ziniu Li
Ziniu Li
The Chinese University of Hong Kong, Shenzhen
Machine LearningReinforcement LearningLarge Language Models
Minglai Yang
Minglai Yang
CS Undergraduate student, University of Arizona
Natural Language ProcessingLarge Language ModelsMachine Learning
B
Bangya Liu
2077AI
Yiming Liang
Yiming Liang
Institute of Automation of the Chinese Academy Sciences (CASIA), M-A-P
LLM
Xiao Fang
Xiao Fang
Professor of Management Information Systems, University of Delaware
FinTechsocial network analyticsmachine learninghealthcare analytics
Qingcheng Zeng
Qingcheng Zeng
PhD Student in NLP, Northwestern University
Computational Social ScienceNLPComputational Linguistics
Jiarui Liu
Jiarui Liu
Carnegie Mellon University
Natural Language Processing
Rui Yang
Rui Yang
Duke-NUS Medical School
Medical InformaticsMedical Text MiningMedical Knowledge Graph
Shen Yan
Shen Yan
Ph.D. student, Peking University
Artificial IntelligenceLarge Language Models
W
Wenhao Huang
M-A-P
J
Jiaheng Liu
M-A-P
Z
Zihan Wang
2077AI