What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

📅 2024-06-18
🏛️ arXiv.org
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit unstable outputs in software applications when prompts undergo minor rephrasings, hindering reliable deployment. Method: This paper introduces two label-free, quantifiable metrics—sensitivity (cross-prompt prediction variance) and consistency (prediction stability across semantically equivalent prompts)—to formally decouple and evaluate LLM robustness to prompt perturbations. Leveraging text classification tasks, we conduct systematic, multi-round prompt rewriting and statistical analysis of prediction distributions. Contribution/Results: Empirical evaluation reveals that mainstream LLMs consistently exhibit high sensitivity and low consistency, exposing a critical robustness gap. Our framework provides a reproducible, ground-truth-label-free diagnostic paradigm for prompt engineering, enabling joint optimization of accuracy and robustness. This work establishes the first formal, measurement-driven approach to assessing and improving LLM resilience against prompt variations.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want to include these models in their software stack, however, face a dreadful challenge: debugging LLMs' inconsistent behavior across minor variations of the prompt. We therefore introduce two metrics for classification tasks, namely sensitivity and consistency, which are complementary to task performance. First, sensitivity measures changes of predictions across rephrasings of the prompt, and does not require access to ground truth labels. Instead, consistency measures how predictions vary across rephrasings for elements of the same class. We perform an empirical comparison of these metrics on text classification tasks, using them as guideline for understanding failure modes of the LLM. Our hope is that sensitivity and consistency will be helpful to guide prompt engineering and obtain LLMs that balance robustness with performance.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Semantic Sensitivity
Prediction Instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

sensitivity
consistency
large language models evaluation
🔎 Similar Papers
No similar papers found.
F
Federico Errica
NEC Italia and NEC Laboratories Europe
G
G. Siracusano
NEC Italia and NEC Laboratories Europe
D
D. Sanvito
NEC Italia and NEC Laboratories Europe
Roberto Bifulco
Roberto Bifulco
NEC Italia and NEC Laboratories Europe