Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This study addresses a critical limitation in traditional AI benchmarking, wherein uniform averaging of task scores overlooks the heterogeneous contributions of individual tasks to societal welfare objectives, thereby decoupling evaluation from real-world impact. To bridge this gap, the authors introduce—grounded in principal–agent theory—the first welfare-oriented tripartite evaluation framework, comprising task–welfare alignment, marginal improvability, and performance variance. They further develop an optimal aggregation and auditing methodology that integrates WORKBank for defining welfare goals, EvoLM 4B for assessing marginal improvability, and a PolyPythias 410M model panel for estimating performance variance. Applied to the OLMES benchmark, this framework successfully identifies Pareto-dominated tasks under pro-labor welfare specifications, establishing a novel paradigm for equitable and effective AI evaluation.
📝 Abstract
AI benchmarks have well-documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. We model benchmarking as a multitask principal-agent game and show that the welfare loss from a benchmark is determined jointly by three item-level primitives: alignment with normative welfare priorities, marginal improvability, and performance variance. We translate the theory into an audit framework that ranks items along each of these three axes, and apply it to OLMES items using WORKBank for welfare, the EvoLM 4B suite for improvability, and the PolyPythias 410M panel for variance. The framework surfaces items that are Pareto-inferior within OLMES subject to a pro-worker welfare operationalization. All code is available at https://github.com/stair-lab/principal-agent-benchmarks.
Problem

Research questions and friction points this paper is trying to address.

AI benchmarks
welfare loss
item aggregation
principal-agent
performance variance
Innovation

Methods, ideas, or system contributions that make the work stand out.

principal-agent framework
benchmark aggregation
welfare alignment
marginal improvability
performance variance
🔎 Similar Papers