🤖 AI Summary
This work addresses the instability and inconsistency in model rankings arising from disparate aggregation methods or sensitivity to model ensembles in multi-metric benchmarking. Framing the issue as a social choice problem, the study models each evaluation metric as generating a preference ordering over models across datasets, with a benchmark operator aggregating these preferences through voting. By identifying structural conditions—such as single-peakedness, group separability, and bounded distance—that circumvent the constraints of Arrow’s impossibility theorem, the paper demonstrates that coherent and stable multi-criterion rankings are achievable. Empirical analysis of prominent benchmarks, including HELM and MMLU, confirms the presence and practical relevance of these preference structures, thereby establishing that rational and stable multi-metric rankings are attainable in real-world settings.
📝 Abstract
Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.