🤖 AI Summary
Evaluating large language models (LLMs) on comprehensive benchmarks is computationally expensive. This work proposes the first systematic application of Maximum Independent Set (MIS) algorithms to prompt selection, constructing a similarity graph based on embedding-space distances to identify diverse and non-redundant prompt subsets. The authors evaluate four MIS solvers, six embedding models, three distance metrics, and multiple threshold settings. Across 99.2% of configurations, the method maintains a Kendall’s coefficient of concordance W ≥ 0.90 (mean: 0.997), reducing prompt counts by 25–48% on average. Significant ranking shifts (ρ < 0.95) occur in only 15.95% of cases, predominantly under low-threshold conditions, revealing a failure regime linked to excessive graph density.
📝 Abstract
Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.