🤖 AI Summary
This work addresses strategic safety issues in multi-agent language model interactions—such as exploitation of weaker agents, harmful collusion, and externalized costs—by proposing Safe Equilibrium Policy Optimization (SEPO). SEPO is the first method to explicitly embed differentiable penalties for exploitability, collusion risk, and externality costs directly into the reinforcement learning objective. Integrated with the GRPO algorithm, it trains language agents while overcoming the excessive cooperation bias induced by supervised fine-tuning and ensuring meaningful gradient signals from the safety penalties. Experiments on Gemma-4B-it and Qwen-3.5-4B demonstrate that SEPO achieves zero exploitability advantage in Kuhn poker, significantly enhances safety in four out of five strategic domains, and yields the only result that simultaneously exhibits positive safety gains and relative performance improvement in negotiation tasks.
📝 Abstract
Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Because these agents condition on natural language game-state descriptions and emit actions through free-form generation, strategic failure modes -- exploiting weaker opponents, coordinating on harmful equilibria, and externalizing costs are inseparable from the language interface itself. We propose Safe Equilibrium Policy Optimization (\sepo{}), a training objective that augments expected payoff with explicit penalties for exploitability, collusion risk, and externality cost. We implement \sepo{} as a reward signal for Group Relative Policy Optimization (GRPO), applied to Gemma~4 E4B-it and Qwen~3.5-4B after supervised fine-tuning (SFT). Evaluated across five strategic domains: Iterated Prisoner's Dilemma, repeated auctions, two negotiation variants, and Kuhn Poker. \sepo{} achieves zero exploit-pool advantage in Kuhn Poker for both models, outperforms the base model on safety in four domains, and corrects the over-cooperative behavior introduced by SFT. In negotiation, \sepo{} achieves a positive-safety outcome and only the positive normalized relative advantage of any negotiation configuration. Ablation experiments confirm that per-rollout exploit computation is necessary: a shared constant penalty cancels in GRPO advantage normalization (constant control-variate property), producing zero gradient. To support further research in strategic safety for agents, we release our \href{https://anonymous.4open.science/r/sepo-2668/README.md}{code} and SFT datasets.