Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies practical security vulnerabilities in large language models (LLMs) that can be triggered by ordinary users during everyday multistep, multilingual human–AI interactions—bypassing the need for sophisticated prompt engineering typical of conventional jailbreak attacks. Method: We propose Speak Easy, a lightweight attack framework that formally defines and quantifies “response-level facilitation of harmful behavior,” introducing HarmScore—a novel evaluation metric grounded in action feasibility. Our approach integrates multistep dialogue modeling, cross-lingual instruction perturbation, and collaborative enhancement strategies. Results: Evaluated across four major safety benchmarks, Speak Easy achieves average improvements of +0.319 in attack success rate and +0.426 in HarmScore for both open- and closed-weight LLMs. These results demonstrate that commonplace interaction patterns already pose tangible, real-world security threats.

Technology Category

Application Category

📝 Abstract
Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative--two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.
Problem

Research questions and friction points this paper is trying to address.

LLM vulnerability to simple jailbreaks
Effectiveness of actionable harmful responses
Exploitation of common LLM interaction patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual multi-step interactions
HarmScore jailbreak metric
Simple interaction-based attack framework
🔎 Similar Papers