Are Humans as Brittle as Large Language Models?

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether human annotators exhibit comparable sensitivity to minor prompt perturbations as large language models (LLMs), to determine whether LLM prompt fragility stems from model deficiencies or reflects inherent variability in human judgment. Using a controlled experimental design in text classification, we apply four types of prompt perturbations—label set substitution, formatting changes, typographical errors, and label order reversal—to both human annotators and LLMs. Results show that both humans and LLMs are sensitive to label set and formatting changes; however, humans demonstrate greater robustness to typographical errors and label order permutations. This work provides the first systematic evidence that prompt brittleness is not exclusive to LLMs but partially arises from intrinsic annotator variance. Moreover, it reveals that current LLMs lag behind humans in semantic robustness, highlighting a critical gap in their alignment with human judgment under linguistic perturbations.

Technology Category

Application Category

📝 Abstract

The output of large language models (LLM) is unstable, due to both non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to instruction changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Comparing human and LLM sensitivity to prompt modifications

Investigating if prompt brittleness reflects human annotation variance

Assessing brittleness differences in label and typographical changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically comparing prompt modifications on humans and LLMs

Using text classification tasks with varied prompt conditions

Analyzing brittleness to label set and format changes

🔎 Similar Papers

No similar papers found.

Authors to Follow