Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Conventional language model safety alignment relies on passive, disjoint adversarial training and defense, leading to attacker overfitting on outdated defenses and defender lag behind emerging threats. Method: We propose Self-RedTeam, an online self-play reinforcement learning framework wherein a single large language model dynamically alternates between attacker and defender roles in a zero-sum game, enabling co-evolutionary, proactive alignment. It introduces the first online self-play training paradigm grounded in Nash equilibrium guarantees and incorporates a hidden chain-of-thought mechanism to enhance attack diversity while mitigating excessive refusal. The framework integrates multi-agent RL, reward modeling via a language model discriminator, and online self-adversarial training. Results: Experiments show a 21.8% increase in attack diversity (measured by SBERT similarity) and a 65.5% improvement in robustness on benchmarks including WildJailBreak, advancing safety alignment from reactive patching toward autonomous co-evolution.

Technology Category

Application Category

📝 Abstract

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

Problem

Research questions and friction points this paper is trying to address.

Addressing reactive safety alignment mismatch in language models

Enabling dynamic co-adaptation via self-play reinforcement learning

Ensuring robust safety through game-theoretic Nash Equilibrium

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online self-play reinforcement learning for LM safety

Two-player zero-sum game for dynamic co-adaptation

Hidden Chain-of-Thought boosts adversarial diversity

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning