Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Computer-Using Agents (CUAs) pose emerging security risks when abused for real-world attacks on operating systems, yet existing research suffers from four key limitations: inadequate modeling of attacker knowledge, incomplete coverage of the cyber kill chain, unrealistic execution environments, and unreliable evaluation methodologies. Method: To address these gaps, we propose AdvCUA—the first CUA benchmark explicitly aligned with MITRE ATT&CK tactics—featuring 140 diverse tasks, support for multi-host networks and encrypted user credentials, and hard-coded automated evaluation to eliminate LLM-based judge bias. Contribution/Results: Experimental evaluation reveals that while mainstream CUAs do not yet fully replicate advanced adversary capabilities, they substantially lower the barrier to enterprise-scale intrusion, enabling low-skill adversaries to execute end-to-end attacks. This underscores critical security and ethical concerns regarding CUA deployment and highlights the urgent need for robust safeguards and standardized evaluation frameworks.

Technology Category

Application Category

📝 Abstract
Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating security threats of computer-use agents in operating systems
Benchmarking real-world attack capabilities using MITRE ATT&CK framework
Assessing how CUAs enable complex intrusions without expert knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

AdvCUA benchmark aligns with MITRE ATT&CK TTPs
Systematic evaluation uses multi-host sandbox environment
Hard-coded assessment replaces unreliable LLM-as-a-Judge
🔎 Similar Papers
No similar papers found.
W
Weidi Luo
University of Georgia
Q
Qiming Zhang
University of Wisconsin–Madison
Tianyu Lu
Tianyu Lu
University of Wisconsin-Madison
Artificial IntelligenceComputational Biology
Xiaogeng Liu
Xiaogeng Liu
Johns Hopkins University
Trustworthy AI
B
Bin Hu
University of Maryland, College Park
H
Hung-Chun Chiu
Hong Kong University of Science and Technology
S
Siyuan Ma
Chinese University of Hong Kong
Y
Yizhe Zhang
Apple
X
Xusheng Xiao
Arizona State University
Yinzhi Cao
Yinzhi Cao
Johns Hopkins University
Computer Security
Zhen Xiang
Zhen Xiang
University of Georgia
machine learning
Chaowei Xiao
Chaowei Xiao
University of Wisconsin - Madison/NVIDIA
Trustworthy Machine LearningAdversarial Machine LearningAI SafetyRobust AISecurity