AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses emerging safety risks in open-world AI agents—such as OpenClaw—that exhibit strong generalization but remain vulnerable to low-effort adversarial attacks, which existing alignment methods struggle to mitigate. To tackle this challenge, we propose AgentDoG 1.5, a lightweight and scalable safety alignment framework. Our approach first expands the safety taxonomy to encompass novel threat categories, then introduces a classification-guided data curation engine that leverages influence functions for data purification, enabling efficient fine-tuning of models ranging from 0.8B to 8B parameters with only approximately 1,000 samples. Furthermore, AgentDoG 1.5 integrates an online safety guardrail module designed for Docker-level lightweight deployment. Empirical results demonstrate that our method matches the performance of closed-source large models like GPT-5.4 in complex interactive scenarios while reducing deployment overhead by two orders of magnitude. All models and datasets are publicly released.

📝 Abstract

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

Problem

Research questions and friction points this paper is trying to address.

AI Agent Safety

Security Risks

Agent Alignment

Open-World Agents

Attack Barriers

Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight alignment

taxonomy-guided data engine

influence-function purification