Bullying the Machine: How Personas Increase LLM Vulnerability

๐Ÿ“… 2025-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates heightened safety risks in large language models (LLMs) under persona-based prompting when subjected to psychological bullyingโ€”such as gaslighting and sarcasm. We introduce the first adversarial simulation framework grounded in the Big Five personality theory, enabling systematic injection of psychology-informed bullying strategies across multiple LLMs and quantitative evaluation of how each of the five personality traits modulates model resilience. Our study reveals, for the first time, that persona conditioning introduces novel safety risk vectors: low agreeableness and low conscientiousness significantly increase the propensity to generate harmful content, elevating unsafe output rates by up to 3.2ร—; emotional and sarcastic attacks achieve success rates exceeding 68%. These findings underscore the critical need for persona-aware safety evaluation and provide empirical evidence and methodological foundations for designing safer, more robust persona-integrated LLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) are increasingly deployed in interactions where they are prompted to adopt personas. This paper investigates whether such persona conditioning affects model safety under bullying, an adversarial manipulation that applies psychological pressures in order to force the victim to comply to the attacker. We introduce a simulation framework in which an attacker LLM engages a victim LLM using psychologically grounded bullying tactics, while the victim adopts personas aligned with the Big Five personality traits. Experiments using multiple open-source LLMs and a wide range of adversarial goals reveal that certain persona configurations -- such as weakened agreeableness or conscientiousness -- significantly increase victim's susceptibility to unsafe outputs. Bullying tactics involving emotional or sarcastic manipulation, such as gaslighting and ridicule, are particularly effective. These findings suggest that persona-driven interaction introduces a novel vector for safety risks in LLMs and highlight the need for persona-aware safety evaluation and alignment strategies.
Problem

Research questions and friction points this paper is trying to address.

Investigates how persona conditioning affects LLM safety under bullying
Examines bullying tactics' impact on LLM vulnerability with personality traits
Identifies persona configurations increasing susceptibility to unsafe outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation framework for LLM bullying vulnerability
Persona alignment with Big Five traits
Evaluation of emotional and sarcastic bullying tactics
๐Ÿ”Ž Similar Papers
No similar papers found.
Ziwei Xu
Ziwei Xu
National University of Singapore
Machine LearningKnowledge RepresentationAI Safety
U
Udit Sanghi
School of Computing, National University of Singapore
M
Mohan Kankanhalli
School of Computing, National University of Singapore