Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a significant mean bias in mainstream text embedding models (e.g., text-embedding-ada-002, BGE), which enables construction of universal, training-free “magic token” suffixes that deterministically perturb input embeddings—thereby manipulating semantic similarity scores and evading embedding-based LLM safety guardrails. It establishes, for the first time, a causal link between embedding distribution bias and jailbreak vulnerability. The authors propose a gradient-guided discrete token search algorithm coupled with a zero-shot, distribution re-centering calibration method—achieving cross-model generalizability without any parameter updates or training overhead. Evaluated across multiple embedding models, the attack achieves >92% jailbreak success rate. Furthermore, the proposed defense—based on bias-aware embedding normalization—restores safety detection accuracy to over 99.5%, demonstrating both practical efficacy and robustness.

Technology Category

Application Category

📝 Abstract
The security issue of large language models (LLMs) has gained significant attention recently, with various defense mechanisms developed to prevent harmful outputs, among which safeguards based on text embedding models serve as a fundamental defense. Through testing, we discover that the distribution of text embedding model outputs is significantly biased with a large mean. Inspired by this observation, we propose novel efficient methods to search for universal magic words that can attack text embedding models. The universal magic words as suffixes can move the embedding of any text towards the bias direction, therefore manipulate the similarity of any text pair and mislead safeguards. By appending magic words to user prompts and requiring LLMs to end answers with magic words, attackers can jailbreak the safeguard. To eradicate this security risk, we also propose defense mechanisms against such attacks, which can correct the biased distribution of text embeddings in a train-free manner.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Security Vulnerabilities
Magic Words Attack
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Embedding Bias
Adversarial Magic Words
Security Enhancement without Additional Training
🔎 Similar Papers
No similar papers found.
H
Haoyu Liang
Dept. of Comp. Sci. and Tech., Inst. for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University, Beijing, China
Y
Youran Sun
Dept. of Math. Sci, Tsinghua University, Beijing, China
Yunfeng Cai
Yunfeng Cai
Beijing Institute of Mathematical Sciences and Applications (BIMSA)
AI
J
Jun Zhu
Dept. of Comp. Sci. and Tech., Inst. for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University, Beijing, China
B
Bo Zhang
Dept. of Comp. Sci. and Tech., Inst. for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University, Beijing, China