Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the instability and inconsistency in model rankings arising from disparate aggregation methods or sensitivity to model ensembles in multi-metric benchmarking. Framing the issue as a social choice problem, the study models each evaluation metric as generating a preference ordering over models across datasets, with a benchmark operator aggregating these preferences through voting. By identifying structural conditions—such as single-peakedness, group separability, and bounded distance—that circumvent the constraints of Arrow’s impossibility theorem, the paper demonstrates that coherent and stable multi-criterion rankings are achievable. Empirical analysis of prominent benchmarks, including HELM and MMLU, confirms the presence and practical relevance of these preference structures, thereby establishing that rational and stable multi-metric rankings are attainable in real-world settings.

Technology Category

Application Category

📝 Abstract

Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.

Problem

Research questions and friction points this paper is trying to address.

multi-criteria benchmarking

preference aggregation

social choice

model ranking

Arrow's impossibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

social choice theory

multi-criteria benchmarking

Arrow's impossibility theorem

single-peaked preferences

benchmark aggregation

🔎 Similar Papers

No similar papers found.

Authors to Follow