Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Evaluating large language models (LLMs) on comprehensive benchmarks is computationally expensive. This work proposes the first systematic application of Maximum Independent Set (MIS) algorithms to prompt selection, constructing a similarity graph based on embedding-space distances to identify diverse and non-redundant prompt subsets. The authors evaluate four MIS solvers, six embedding models, three distance metrics, and multiple threshold settings. Across 99.2% of configurations, the method maintains a Kendall’s coefficient of concordance W ≥ 0.90 (mean: 0.997), reducing prompt counts by 25–48% on average. Significant ranking shifts (ρ < 0.95) occur in only 15.95% of cases, predominantly under low-threshold conditions, revealing a failure regime linked to excessive graph density.

📝 Abstract

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

Problem

Research questions and friction points this paper is trying to address.

LLM benchmarking

prompt selection

evaluation efficiency

redundancy reduction

model ranking consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximum Independent Set

Prompt Selection

Similarity Graph