🤖 AI Summary
Quantifying output variability of large language models (LLMs) under input perturbations or model variants remains challenging due to the intractability of explicit probability distribution modeling and sensitivity to stochasticity.
Method: This paper proposes a black-box robustness auditing framework that formulates output divergence detection as a statistical hypothesis test in semantic embedding space (e.g., BERTScore, STS). It constructs empirical null distributions via Monte Carlo sampling—bypassing explicit modeling of output distributions and mitigating randomness-induced bias.
Contribution/Results: We introduce the first distributed perturbation analysis paradigm, enabling model-agnostic, multi-perturbation joint testing, interpretable p-values, scalar effect sizes, and integrated multiple-testing correction (e.g., Bonferroni). Experiments demonstrate accurate quantification of response shifts, reliable estimation of true/false positive rates, and cross-model consistency assessment—achieving significantly improved reliability and reproducibility in LLM robustness evaluation without distributional assumptions.
📝 Abstract
Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.