Agent Safety Alignment via Reinforcement Learning

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Autonomous LLM agents face dual-channel security threats during external tool invocation—adversarial user prompts and malicious tool outputs—yet existing defenses lack a unified safety alignment framework. This paper proposes the first safety-aligned framework for tool-calling agents, integrating structured reasoning with sandboxed reinforcement learning to jointly optimize security and task utility. Key contributions include: (1) a tri-modal safety taxonomy distinguishing benign, malicious, and sensitive tool interactions; (2) a policy-driven decision model that dynamically governs tool invocation and response generation; and (3) the first real-tool-execution sandbox enabling fine-grained, reward-shaped safety training. Evaluated on multiple public and custom benchmarks, our approach achieves significant improvements in robustness against adversarial inputs and malicious tool behaviors while preserving original task performance across diverse domains.

Technology Category

Application Category

📝 Abstract

The emergence of autonomous Large Language Model (LLM) agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user-initiated threats (e.g., adversarial prompts) and tool-initiated threats (e.g., malicious outputs from compromised tools). In this paper, we propose the first unified safety-alignment framework for tool-using agents, enabling models to handle both channels of threat via structured reasoning and sandboxed reinforcement learning. We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses, and define a policy-driven decision model. Our framework employs a custom-designed sandbox environment that simulates real-world tool execution and allows fine-grained reward shaping. Through extensive evaluations on public and self-built benchmarks, including Agent SafetyBench, InjecAgent, and BFCL, we demonstrate that our safety-aligned agents significantly improve resistance to security threats while preserving strong utility on benign tasks. Our results show that safety and effectiveness can be jointly optimized, laying the groundwork for trustworthy deployment of autonomous LLM agents.

Problem

Research questions and friction points this paper is trying to address.

Addressing safety risks in autonomous LLM agents using tools

Handling user and tool-initiated threats via structured reasoning

Optimizing safety and effectiveness for trustworthy agent deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified safety-alignment framework for tool-using agents

Tri-modal taxonomy for threat classification

Sandboxed reinforcement learning with fine-grained rewards

🔎 Similar Papers

Safety-Aware Multi-Agent Learning for Dynamic Network Bridging