🤖 AI Summary
This work addresses emerging safety risks in open-world AI agents—such as OpenClaw—that exhibit strong generalization but remain vulnerable to low-effort adversarial attacks, which existing alignment methods struggle to mitigate. To tackle this challenge, we propose AgentDoG 1.5, a lightweight and scalable safety alignment framework. Our approach first expands the safety taxonomy to encompass novel threat categories, then introduces a classification-guided data curation engine that leverages influence functions for data purification, enabling efficient fine-tuning of models ranging from 0.8B to 8B parameters with only approximately 1,000 samples. Furthermore, AgentDoG 1.5 integrates an online safety guardrail module designed for Docker-level lightweight deployment. Empirical results demonstrate that our method matches the performance of closed-source large models like GPT-5.4 in complex interactive scenarios while reducing deployment overhead by two orders of magnitude. All models and datasets are publicly released.
📝 Abstract
Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.