Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study addresses a critical limitation in traditional AI benchmarking, wherein uniform averaging of task scores overlooks the heterogeneous contributions of individual tasks to societal welfare objectives, thereby decoupling evaluation from real-world impact. To bridge this gap, the authors introduce—grounded in principal–agent theory—the first welfare-oriented tripartite evaluation framework, comprising task–welfare alignment, marginal improvability, and performance variance. They further develop an optimal aggregation and auditing methodology that integrates WORKBank for defining welfare goals, EvoLM 4B for assessing marginal improvability, and a PolyPythias 410M model panel for estimating performance variance. Applied to the OLMES benchmark, this framework successfully identifies Pareto-dominated tasks under pro-labor welfare specifications, establishing a novel paradigm for equitable and effective AI evaluation.

📝 Abstract

AI benchmarks have well-documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. We model benchmarking as a multitask principal-agent game and show that the welfare loss from a benchmark is determined jointly by three item-level primitives: alignment with normative welfare priorities, marginal improvability, and performance variance. We translate the theory into an audit framework that ranks items along each of these three axes, and apply it to OLMES items using WORKBank for welfare, the EvoLM 4B suite for improvability, and the PolyPythias 410M panel for variance. The framework surfaces items that are Pareto-inferior within OLMES subject to a pro-worker welfare operationalization. All code is available at https://github.com/stair-lab/principal-agent-benchmarks.

Problem

Research questions and friction points this paper is trying to address.

AI benchmarks

welfare loss

item aggregation

principal-agent

performance variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

principal-agent framework

benchmark aggregation

welfare alignment