Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This study addresses the vulnerability of current large language model (LLM) alignment methods to jailbreak attacks that exploit natural human writing styles. The authors propose the first general-purpose jailbreaking framework grounded in authentic discourse domains, leveraging twelve fanfiction subgenres from the Archive of Our Own platform to construct creative writing templates that embed harmful content at narrative climaxes—without requiring attacker models or target-specific adaptation. Their static four-turn expansion method, SAGA-A4, dramatically increases attack success rates (ASR) from 0.278 to 0.924 across eight aligned LLMs, substantially outperforming existing approaches. The work exposes a fundamental limitation in contemporary safety training: inadequate coverage of diverse writing styles. Attack efficacy is rigorously validated through an ensemble of four human evaluators and factorization-based analysis.
📝 Abstract
Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.
Problem

Research questions and friction points this paper is trying to address.

jailbreak
aligned LLMs
fanfiction subgenres
vernacular
safety training
Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak
register-based attack
fanfiction subgenres
aligned LLMs
SAGA-A4
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
Zhongze Luo
Zhongze Luo
The Chinese University of Hong Kong, Shenzhen
LLMKGRAG
R
Ruihe Shi
School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
Z
Zhenshuai Yin
School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)
Haoyue Liu
Haoyue Liu
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Computer VisionEvent Camera
W
Weixuan Wan
School of Microelectronics, Xi’an Jiaotong University
X
Xiaoying Tang
School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen); The Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen); The Guangdong Provincial Key Laboratory of Future Networks of Intelligence