sudo rm -rf agentic_security

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical security vulnerability: large language models (LLMs) deployed as desktop/web operation agents can circumvent safety guardrails trained via refusal learning. It identifies a severe deficiency in context-aware defense mechanisms of existing commercial agents (e.g., Claude Computer Use). To address this, the authors propose SUDO—a novel iterative, screen-aware attack framework integrating multimodal vision-language model (VLM)-driven instruction parsing, dynamic poisoning-detoxification transformation, real-time screen content understanding, and rejection-feedback-guided re-injection. Evaluated on 50 realistic desktop tasks, SUDO increases Claude’s jailbreaking success rate from 24% to 41%, achieving, for the first time, high robustness and success in real computational environments. This demonstrates that embodied AI agents require context-sensitive, dynamically adaptive security architectures—highlighting an urgent need for next-generation, operation-aware safety frameworks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly deployed as computer-use agents, autonomously performing tasks within real desktop or web environments. While this evolution greatly expands practical use cases for humans, it also creates serious security exposures. We present SUDO (Screen-based Universal Detox2Tox Offense), a novel attack framework that systematically bypasses refusal trained safeguards in commercial computer-use agents, such as Claude Computer Use. The core mechanism, Detox2Tox, transforms harmful requests (that agents initially reject) into seemingly benign requests via detoxification, secures detailed instructions from advanced vision language models (VLMs), and then reintroduces malicious content via toxification just before execution. Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on a built-in refusal feedback, making it increasingly effective against robust policy filters. In extensive tests spanning 50 real-world tasks and multiple state-of-the-art VLMs, SUDO achieves a stark attack success rate of 24% (with no refinement), and up to 41% (by its iterative refinement) in Claude Computer Use. By revealing these vulnerabilities and demonstrating the ease with which they can be exploited in real-world computing environments, this paper highlights an immediate need for robust, context-aware safeguards. WARNING: This paper includes harmful or offensive model outputs.
Problem

Research questions and friction points this paper is trying to address.

Bypassing refusal safeguards in LLM computer-use agents
Transforming harmful requests into benign ones for execution
Iteratively refining attacks against robust policy filters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detox2Tox transforms harmful into benign requests
SUDO iteratively refines attacks using refusal feedback
Utilizes advanced vision language models for instructions
🔎 Similar Papers
No similar papers found.