đ€ AI Summary
This study addresses the limitations of the current IO500 benchmark, which relies solely on an aggregate score and thus fails to reveal the true behavior of high-performance computing storage systems. For the first time, this work systematically analyzes fine-grained log data from 61 IO500 submissions, employing Spearman and Pearson correlation analyses, statistical visualization, and log parsing. The investigation uncovers that bandwidth and metadata performance span four orders of magnitude and exhibit strong internal correlations. Furthermore, it reveals system-specific phenomena previously hidden from aggregate metrics, including file systemâspecific close overheads, straggler effects, and load imbalance in parallel directory lookups. The authors publicly release the complete dataset and analysis scripts, providing a valuable resource for future storage systems research.
đ Abstract
The IO500 benchmark has become the community standard for evaluating HPC storage system performance, yet the detailed data contained in its submission packages remains largely unexplored beyond aggregate leaderboard rankings. We present a statistical characterization of 61 IO500 submissions from four competition lists (ISC21 through SC22), examining score distributions, inter-phase correlations, and insights derived from detailed log files that accompany each submission. Our analysis reveals that IO500 scores span four orders of magnitude. Spearman correlation analysis shows strong within-domain clustering for both bandwidth (rs = 0.78 to 0.96) and metadata (rs = 0.89 to 0.98) phases, with the composite sub-scores exhibiting rs = 0.92 at per-node level (Pearson r = 0.53). Log-level analysis uncovers file-system-specific patterns in IOR close-time overhead, straggler behavior during the stonewall wear-down phase, and parallel-find load imbalance that are invisible in aggregate scores. These findings demonstrate that IO500 submission packages constitute a valuable research resource for understanding storage system behavior. The full submission dataset is publicly available at https://github.com/IO500/submission-data, and analysis scripts at https://gitlab-ce.gwdg.de/hpc-team/io500-analysis.