🤖 AI Summary
This work addresses the vulnerability of current large language models to prompt rewriting attacks—such as role-playing and fictional framing—that circumvent safety alignment mechanisms and resist adaptive black-box attacks. The authors propose a red-teaming and blue-teaming co-evolutionary closed-loop adversarial framework: the attacker employs template-free reinforcement learning to discover high-fidelity, high-evasion adversarial prompts and uncovers transferable attack primitives across attack families; the defender enhances robustness via a two-stage training strategy combining GRPO, rejection-sampling-based supervised fine-tuning, and multi-objective reward optimization. Evaluated on BeaverTails and JailbreakBench, the method reduces the StrongREJECT score by 43.2% on average against five unseen attack types while maintaining a 0% false rejection rate on benign prompts, substantially improving generalizable safety defenses.
📝 Abstract
Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.