Instance-level Randomization: Toward More Stable LLM Evaluations

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LLM evaluation suffers from score instability and model ranking fluctuations due to stochastic factors—such as few-shot example selection, ordering, and content—rendering fixed-setting evaluations prone to unfair comparisons. Method: We propose an instance-level randomized evaluation framework that dynamically samples stochastic factors per input instance (e.g., example order and composition) and aggregates results across multiple independent runs via averaging to reduce variance. Contribution/Results: We provide the first theoretical analysis of variance sources induced by such randomness in LLM evaluation and prove our method reduces evaluation variance while cutting computational cost by over 50%. Experiments demonstrate that the new paradigm significantly improves score stability and ranking robustness, outperforming conventional fixed-setting approaches under substantially lower computational budgets, thereby mitigating misjudgment risks stemming from stochasticity.

Technology Category

Application Category

📝 Abstract
Evaluations of large language models (LLMs) suffer from instability, where small changes of random factors such as few-shot examples can lead to drastic fluctuations of scores and even model rankings. Moreover, different LLMs can have different preferences for a certain setting of random factors. As a result, using a fixed setting of random factors, which is often adopted as the paradigm of current evaluations, can lead to potential unfair comparisons between LLMs. To mitigate the volatility of evaluations, we first theoretically analyze the sources of variance induced by changes in random factors. Targeting these specific sources, we then propose the instance-level randomization (ILR) method to reduce variance and enhance fairness in model comparisons. Instead of using a fixed setting across the whole benchmark in a single experiment, we randomize all factors that affect evaluation scores for every single instance, run multiple experiments and report the averaged score. Theoretical analyses and empirical results demonstrate that ILR can reduce the variance and unfair comparisons caused by random factors, as well as achieve similar robustness level with less than half computational cost compared with previous methods.
Problem

Research questions and friction points this paper is trying to address.

Addressing instability in LLM evaluations from random factors
Reducing variance and unfair comparisons between different LLMs
Proposing a method to enhance evaluation fairness and robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomizes factors per instance for variance reduction
Averages scores from multiple randomized experiments
Reduces computational cost while maintaining robustness
🔎 Similar Papers
No similar papers found.
Yiyang Li
Yiyang Li
University of Michigan
Y
Yonghuang Wu
Fudan University
Y
Ying Luo
Meituan Group
Liangtai Sun
Liangtai Sun
Master, Shanghai Jiao Tong University
NLPGUI understandingMulti-modal
Z
Zishu Qin
Fudan University
L
Lin Qiu
Meituan Group
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs
X
Xunliang Cai
Meituan Group