Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional self-play methods often converge to local optima, hindering policy diversity and open-ended innovation. This paper proposes the Foundation Model-based Self-Play (FMSP) framework—comprising three variants: vanilla FMSP, Novelty-Seeking Self-Play (NSSP), and Quality-Diversity Self-Play (QDSP)—which pioneers the integration of large language models (LLMs) into self-play. By synergistically combining prompt engineering, reinforcement learning, tree search, and heuristic methods, FMSP enables competitive play, novelty-driven exploration, and quality-diversity co-evolution. Evaluated on continuous-control tasks (e.g., Car Tag), FMSP outperforms handcrafted policies; in AI safety benchmarks (Gandalf jailbreak testing), it successfully breaches all six defense levels and autonomously generates mitigation patches, demonstrating feasibility of automated red-teaming and defensive strengthening. The core contribution lies in enabling leapfrog optimization across the policy space and generating high-quality, diverse policies that transcend local optima.

Technology Category

Application Category

📝 Abstract

Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum toward learning high-quality solutions. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across local optima in policy space. We propose a family of approaches: (1) extbf{Vanilla Foundation-Model Self-Play (vFMSP)} continually refines agent policies via competitive self-play; (2) extbf{Novelty-Search Self-Play (NSSP)} builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, extbf{Quality-Diveristy Self-Play (QDSP)}, creates a diverse set of high-quality policies by combining the diversity of NSSP and refinement of vFMSP. We evaluate FMSPs in Car Tag, a continuous-control pursuer-evader setting, and in Gandalf, a simple AI safety simulation in which an attacker tries to jailbreak an LLM's defenses. In Car Tag, FMSPs explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, ouralgo and vFMSP surpass strong human-designed strategies. In Gandalf, FMSPs can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs can automatically proceed to patch the discovered vulnerabilities. Overall, FMSPs represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery

Problem

Research questions and friction points this paper is trying to address.

Overcoming local optima in self-play algorithms

Generating diverse high-quality strategies via foundation models

Automating vulnerability discovery and patching in AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages foundation models for policy leaps

Combines quality and diversity in strategies

Automatically red-teams and patches vulnerabilities

🔎 Similar Papers

No similar papers found.

Authors to Follow