🤖 AI Summary
Traditional self-play methods often converge to local optima, hindering policy diversity and open-ended innovation. This paper proposes the Foundation Model-based Self-Play (FMSP) framework—comprising three variants: vanilla FMSP, Novelty-Seeking Self-Play (NSSP), and Quality-Diversity Self-Play (QDSP)—which pioneers the integration of large language models (LLMs) into self-play. By synergistically combining prompt engineering, reinforcement learning, tree search, and heuristic methods, FMSP enables competitive play, novelty-driven exploration, and quality-diversity co-evolution. Evaluated on continuous-control tasks (e.g., Car Tag), FMSP outperforms handcrafted policies; in AI safety benchmarks (Gandalf jailbreak testing), it successfully breaches all six defense levels and autonomously generates mitigation patches, demonstrating feasibility of automated red-teaming and defensive strengthening. The core contribution lies in enabling leapfrog optimization across the policy space and generating high-quality, diverse policies that transcend local optima.
📝 Abstract
Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum toward learning high-quality solutions. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across local optima in policy space. We propose a family of approaches: (1) extbf{Vanilla Foundation-Model Self-Play (vFMSP)} continually refines agent policies via competitive self-play; (2) extbf{Novelty-Search Self-Play (NSSP)} builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, extbf{Quality-Diveristy Self-Play (QDSP)}, creates a diverse set of high-quality policies by combining the diversity of NSSP and refinement of vFMSP. We evaluate FMSPs in Car Tag, a continuous-control pursuer-evader setting, and in Gandalf, a simple AI safety simulation in which an attacker tries to jailbreak an LLM's defenses. In Car Tag, FMSPs explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, ouralgo and vFMSP surpass strong human-designed strategies. In Gandalf, FMSPs can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs can automatically proceed to patch the discovered vulnerabilities. Overall, FMSPs represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery