Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

236K/year
🤖 AI Summary
Evaluating large language models (LLMs) on comprehensive benchmarks is computationally expensive. This work proposes the first systematic application of Maximum Independent Set (MIS) algorithms to prompt selection, constructing a similarity graph based on embedding-space distances to identify diverse and non-redundant prompt subsets. The authors evaluate four MIS solvers, six embedding models, three distance metrics, and multiple threshold settings. Across 99.2% of configurations, the method maintains a Kendall’s coefficient of concordance W ≥ 0.90 (mean: 0.997), reducing prompt counts by 25–48% on average. Significant ranking shifts (ρ < 0.95) occur in only 15.95% of cases, predominantly under low-threshold conditions, revealing a failure regime linked to excessive graph density.
📝 Abstract
Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.
Problem

Research questions and friction points this paper is trying to address.

LLM benchmarking
prompt selection
evaluation efficiency
redundancy reduction
model ranking consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximum Independent Set
Prompt Selection
Similarity Graph
LLM Benchmarking
Efficiency Optimization
🔎 Similar Papers
No similar papers found.
D
Denica Kjorvezir
Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
M
Marko Djukanović
Center for Astrophysics and Cosmology, University of Nova Gorica, Nova Gorica, Slovenia
A
Ana Gjorgjevikj
Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
Gjorgjina Cenikj
Gjorgjina Cenikj
Young Researcher, Jožef Stefan Institute
machine learningdeep learningoptimizationautomated machine learningNLP
Tome Eftimov
Tome Eftimov
Computer Systems Department, Jožef Stefan Institute
StatisticsStochastic Optimization AlgorithmsMachine learningNatural Language Processing