SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

๐Ÿ“… 2026-06-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

235K/year
๐Ÿค– AI Summary
This work addresses the safety risks and power-seeking behaviors that arise when large language model (LLM) agents operate in complex environments via the Model Context Protocol (MCP), where an expanded action space can amplify minor errors into catastrophic outcomes. To mitigate these issues, the authors propose SafeMCP, a server-side defense plugin that introduces a novel three-stage training pipeline integrating environment dynamics modeling with verifiable rewards. Leveraging an internal world model for forward-looking reasoning, SafeMCP grounds agent behavior in environmental dynamics, initializes safe policies, and employs dual-reward reinforcement learning. The approach incorporates a dual mechanism of proactive tool filtering and real-time intervention, significantly reducing risk on benchmarks such as PowerSeeking Bench, ToolEmu, and AgentHarm while preserving task performanceโ€”thus effectively balancing safety and utility.
๐Ÿ“ Abstract
As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.
Problem

Research questions and friction points this paper is trying to address.

power-seeking
LLM agents
action space
safety risk
Model Context Protocol
Innovation

Methods, ideas, or system contributions that make the work stand out.

look-ahead reasoning
proactive power regulation
Model Context Protocol
tool filtering
reinforcement learning
L
Lichao Wang
Beijing Institute of Technology
Z
Zhaoxing Ren
Beijing Academy of Artificial Intelligence
T
Tianzhuo Yang
Institute for Artificial Intelligence, Peking University
J
Jiaming Ji
Institute for Artificial Intelligence, Peking University
Chi Harold Liu
Chi Harold Liu
Professor, Vice Dean, Fellow of IET and BCS, Beijing Institute of Technology
IoTMobile Crowd SensingUAV CrowdsensingEmbodied AIDeep Reinforcement Learning
Y
Yaodong Yang
Beijing Academy of Artificial Intelligence
J
Juntao Dai
Beijing Academy of Artificial Intelligence