🤖 AI Summary
Existing CPU benchmarks (e.g., SPEC CPU2017) lack explicit system configuration specifications, leading to performance interference from non-CPU components and severely undermining comparability, consistency, and reproducibility.
Method: We propose a novel CPU performance evaluation paradigm grounded in the principle of “fully specified and valid configurations,” establishing a systematic modeling framework that spans the complete configuration space; we design an unbiased sampling strategy that uniformly weights all compliant configurations; and we replace point estimates with confidence intervals and associated confidence levels for performance reporting.
Results: Experiments reveal up to 74.49× performance variation for the same CPU across compliant configurations. Our framework eliminates configuration ambiguity entirely, enabling fair cross-CPU comparisons and significantly improving consistency, reproducibility, and statistical rigor of benchmarking outcomes.
📝 Abstract
The SPEC CPU2017 benchmark suite is an industry standard for accessing CPU performance. It adheres strictly to some workload and system configurations - arbitrary specificity - while leaving other system configurations undefined - arbitrary ambiguity. This article reveals: (1) Arbitrary specificity proves not meaningful, obscuring many scenarios, as evidenced by significant performance variations, a 74.49x performance difference observed on the same CPU. (2) Arbitrary ambiguity is unfair as it fails to establish the same configurations for comparing different CPUs. We propose an innovative CPU evaluation methodology. It considers all workload and system configurations valid and mandates each configuration to be well-defined to avoid arbitrary specificity and ambiguity. To reduce the evaluation cost, a sampling approach is proposed to select a subset of the configurations. To expose CPU performance under different scenarios, it treats all outcomes under each configuration as equally important. Finally, it utilizes confidence level and confidence interval to report the outcomes to avoid bias.