Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study investigates whether toxic language in prompts compromises the factual reliability of large language models without altering semantic content. Through controlled experiments and attribution graph analysis, the authors systematically evaluate the impact of lexical and tonal perturbations—spanning polite, random, and varying levels of toxic phrasing—on model output accuracy and internal activation pathways. The work reveals, for the first time, that prompt tone constitutes a critical dimension affecting model reliability: toxic perturbations significantly reduce factual accuracy, increase output uncertainty, and selectively amplify activations in perturbation-sensitive neural units, while core reasoning modules remain relatively stable. These findings provide dual behavioral and mechanistic evidence for how linguistic style interferes with model functioning.

📝 Abstract

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

Problem

Research questions and friction points this paper is trying to address.

toxicity

prompt perturbation

factual reliability

large language models

hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

toxic prompts

factual reliability

prompt perturbation