🤖 AI Summary
This study addresses the vulnerability of current large language model (LLM) alignment methods to jailbreak attacks that exploit natural human writing styles. The authors propose the first general-purpose jailbreaking framework grounded in authentic discourse domains, leveraging twelve fanfiction subgenres from the Archive of Our Own platform to construct creative writing templates that embed harmful content at narrative climaxes—without requiring attacker models or target-specific adaptation. Their static four-turn expansion method, SAGA-A4, dramatically increases attack success rates (ASR) from 0.278 to 0.924 across eight aligned LLMs, substantially outperforming existing approaches. The work exposes a fundamental limitation in contemporary safety training: inadequate coverage of diverse writing styles. Attack efficacy is rigorously validated through an ensemble of four human evaluators and factorization-based analysis.
📝 Abstract
Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.