LLM Stability: A detailed analysis with some surprises

📅 2024-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit non-deterministic outputs under zero-shot and few-shot settings, yet the extent and impact of this instability remain systematically unassessed. This work conducts the first cross-run stability analysis across five state-of-the-art LLMs on eight canonical NLP tasks, with each task executed 10 times per model. We employ both string-level matching and structured answer parsing to quantify output variability. To measure consistency, we propose two novel metrics: TARr@N (raw-output consistency) and TARa@N (parsed-answer consistency). Our evaluation reveals accuracy fluctuations of up to 15 percentage points, with severe performance degradation—defined as ≥10-point accuracy drop—occurring in 70% of task-model configurations. Critically, no model achieves consistent outputs or accuracy across all tasks, suggesting that non-determinism may stem from low-level hardware resource scheduling rather than model architecture alone. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. Yet the questions of how pervasive this is, and with what impact on results, have not to our knowledge been systematically investigated. We investigate non-determinism in five LLMs configured to be deterministic when applied to eight common tasks in across 10 runs, in both zero-shot and few-shot settings. We see accuracy variations up to 15% across naturally occurring runs with a gap of best possible performance to worst possible performance up to 70%. In fact, none of the LLMs consistently delivers repeatable accuracy across all tasks, much less identical output strings. Sharing preliminary results with insiders has revealed that non-determinism perhaps essential to the efficient use of compute resources via co-mingled data in input buffers so this issue is not going away anytime soon. To better quantify our observations, we introduce metrics focused on quantifying determinism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data are publicly available at http://github.com/REDACTED.
Problem

Research questions and friction points this paper is trying to address.

Investigates pervasive non-determinism in LLM outputs across tasks
Measures accuracy variations up to 70% gap in performance
Proposes metrics (TARr@N, TARa@N) to quantify output determinism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing non-determinism in five LLMs
Introducing metrics TARr@N and TARa@N
Investigating accuracy variations up to 70%
🔎 Similar Papers
No similar papers found.
B
Berk Atil
Penn State University, Comcast AI Technologies
A
Alexa Chittams
Comcast AI Technologies
Lisheng Fu
Lisheng Fu
Comcast AI Technologies
Ferhan Ture
Ferhan Ture
NLP Research @ Comcast AI
Natural Language ProcessingInformation RetrievalMachine Learning
L
Lixinyu Xu
Comcast AI Technologies
B
Breck Baldwin
Comcast AI Technologies