CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the vulnerability of current large language models to prompt rewriting attacks—such as role-playing and fictional framing—that circumvent safety alignment mechanisms and resist adaptive black-box attacks. The authors propose a red-teaming and blue-teaming co-evolutionary closed-loop adversarial framework: the attacker employs template-free reinforcement learning to discover high-fidelity, high-evasion adversarial prompts and uncovers transferable attack primitives across attack families; the defender enhances robustness via a two-stage training strategy combining GRPO, rejection-sampling-based supervised fine-tuning, and multi-objective reward optimization. Evaluated on BeaverTails and JailbreakBench, the method reduces the StrongREJECT score by 43.2% on average against five unseen attack types while maintaining a 0% false rejection rate on benign prompts, substantially improving generalizable safety defenses.
📝 Abstract
Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.
Problem

Research questions and friction points this paper is trying to address.

adversarial attacks
LLM safety
prompt rewriting
black-box adversaries
safety alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Red-Blue Teaming
Reinforcement Learning
LLM Safety
Group Relative Policy Optimization
Safety Alignment
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
R
Rahul Markasserithodi
University of New South Wales, Australia
Aditya Joshi
Aditya Joshi
Senior Lecturer/Assistant Professor, UNSW
Natural Language ProcessingAI for Social Good
Yuekang Li
Yuekang Li
Lecturer (Assistant Professor), University of New South Wales
Software EngineeringSoftware SecurityAI Red Teaming
I
Ishmanbir Singh
University of New South Wales, Australia
C
Chris Yoo
University of New South Wales, Australia
A
Alan Niu
University of New South Wales, Australia