BraveGuard: From Open-World Threats to Safer Computer-Use Agents

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Existing security mechanisms struggle to detect subtle risks emerging from multi-step executions by AI agents. This work proposes BraveGuard, a self-evolving defense framework that constructs trajectory-level supervision signals by mining real-world threat indicators and agent execution traces in open environments, thereby training generalizable protection models. Moving beyond the limitations of static benchmarks and synthetic data, BraveGuard establishes an adaptive defense loop encompassing threat discovery, task instantiation, trajectory collection, and model training, and is compatible with mainstream guardrail backbones such as Qwen3-Guard and Llama-Guard. Evaluated on the AgentHazard benchmark, the approach significantly improves detection accuracy from 38.79% to 82.38%, substantially enhancing the identification of multi-step malicious behaviors.
📝 Abstract
Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.
Problem

Research questions and friction points this paper is trying to address.

computer-use agents
safety risks
multi-step execution
trajectory-level harm
open-world threats
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving defense
trajectory-level supervision
open-world threat discovery
computer-use agents
adaptive guard training
🔎 Similar Papers
Y
Yunhao Feng
Fudan University, Ant Group
Y
Yifan Ding
Fudan University
X
Xiaohu Du
Ant Group
M
Ming Wen
Fudan University, Shanghai Innovation Institute
X
Xinhao Deng
Ant Group
Yanming Guo
Yanming Guo
National University of Defense Technology
deep learningcomputer vision
Y
Yuxiang Xie
Hunan Institute of Advanced Technology
B
Baihui Zheng
Alibaba Group
Y
Yingshui Tan
Alibaba Group
Yige Li
Yige Li
Singapore Management University
Trustworthy Machine Learning
Y
Yutao Wu
Deakin University
Y
Yixu Wang
Fudan University
K
Kerui Cao
Alibaba Group
Wenke Huang
Wenke Huang
School of Computer Science, Wuhan University
Federated LearningMLLM
Xingjun Ma
Xingjun Ma
Fudan University
Trustworthy AIMultimodal AIGenerative AIEmbodied AI
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI