๐ค AI Summary
This work addresses the safety risks and power-seeking behaviors that arise when large language model (LLM) agents operate in complex environments via the Model Context Protocol (MCP), where an expanded action space can amplify minor errors into catastrophic outcomes. To mitigate these issues, the authors propose SafeMCP, a server-side defense plugin that introduces a novel three-stage training pipeline integrating environment dynamics modeling with verifiable rewards. Leveraging an internal world model for forward-looking reasoning, SafeMCP grounds agent behavior in environmental dynamics, initializes safe policies, and employs dual-reward reinforcement learning. The approach incorporates a dual mechanism of proactive tool filtering and real-time intervention, significantly reducing risk on benchmarks such as PowerSeeking Bench, ToolEmu, and AgentHarm while preserving task performanceโthus effectively balancing safety and utility.
๐ Abstract
As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.