Statistical Hypothesis Testing for Auditing Robustness in Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Quantifying output variability of large language models (LLMs) under input perturbations or model variants remains challenging due to the intractability of explicit probability distribution modeling and sensitivity to stochasticity. Method: This paper proposes a black-box robustness auditing framework that formulates output divergence detection as a statistical hypothesis test in semantic embedding space (e.g., BERTScore, STS). It constructs empirical null distributions via Monte Carlo sampling—bypassing explicit modeling of output distributions and mitigating randomness-induced bias. Contribution/Results: We introduce the first distributed perturbation analysis paradigm, enabling model-agnostic, multi-perturbation joint testing, interpretable p-values, scalar effect sizes, and integrated multiple-testing correction (e.g., Bonferroni). Experiments demonstrate accurate quantification of response shifts, reliable estimation of true/false positive rates, and cross-model consistency assessment—achieving significantly improved reliability and reproducibility in LLM robustness evaluation without distributional assumptions.

Technology Category

Application Category

📝 Abstract

Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.

Problem

Research questions and friction points this paper is trying to address.

Testing LLM output changes under arbitrary interventions

Overcoming stochasticity and computational intractability in comparisons

Providing interpretable hypothesis testing for LLM robustness auditing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses distribution-based perturbation analysis framework

Constructs empirical distributions via Monte Carlo sampling

Provides model-agnostic hypothesis testing for LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow