Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM safety evaluation methods suffer from non-comparable models, substantial prompt bias, and metrics that ignore output uncertainty—leading to unreliable assessments. To address these issues, we propose an end-to-end Bayesian evaluation framework tailored for prompt injection attacks, integrating controlled experimental design, embedding-space clustering, and Bayesian hierarchical modeling to jointly account for output stochasticity, limited sample sizes, and computational constraints. Our framework enables, for the first time, quantitative comparison of LLM safety both during training and deployment, significantly improving risk inference accuracy across diverse prompt injection scenarios. Experimental results reveal fundamental architectural differences in security vulnerabilities: under identical data, Transformer- and Mamba-based models exhibit distinct failure patterns—demonstrating the framework’s effectiveness and practical utility in identifying architecture-level weaknesses.

Technology Category

Application Category

📝 Abstract
Before adopting a new large language model (LLM) architecture, it is critical to understand vulnerabilities accurately. Existing evaluations can be difficult to trust, often drawing conclusions from LLMs that are not meaningfully comparable, relying on heuristic inputs or employing metrics that fail to capture the inherent uncertainty. In this paper, we propose a principled and practical end-to-end framework for evaluating LLM vulnerabilities to prompt injection attacks. First, we propose practical approaches to experimental design, tackling unfair LLM comparisons by considering two practitioner scenarios: when training an LLM and when deploying a pre-trained LLM. Second, we address the analysis of experiments and propose a Bayesian hierarchical model with embedding-space clustering. This model is designed to improve uncertainty quantification in the common scenario that LLM outputs are not deterministic, test prompts are designed imperfectly, and practitioners only have a limited amount of compute to evaluate vulnerabilities. We show the improved inferential capabilities of the model in several prompt injection attack settings. Finally, we demonstrate the pipeline to evaluate the security of Transformer versus Mamba architectures. Our findings show that consideration of output variability can suggest less definitive findings. However, for some attacks, we find notably increased Transformer and Mamba-variant vulnerabilities across LLMs with the same training data or mathematical ability.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM vulnerabilities to prompt injection attacks reliably
Addressing unfair comparisons in LLM security testing scenarios
Improving uncertainty quantification for non-deterministic LLM outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian hierarchical model for uncertainty quantification
Embedding-space clustering to analyze LLM outputs
Practical experimental design for LLM vulnerability evaluation
🔎 Similar Papers
No similar papers found.
M
Mary Llewellyn
The Alan Turing Institute
A
Annie Gray
The Alan Turing Institute
Josh Collyer
Josh Collyer
PhD Student, Loughborough University
cybermachine learningdeep learningbinary analysisinformation retrieval
M
Michael Harries
The Alan Turing Institute