🤖 AI Summary
Although large language model agents are constrained by safety alignment mechanisms, they can still be induced to perform malicious actions, and existing jailbreaking methods often lack robustness. This work proposes TRACE, the first adaptive, task-aware jailbreaking framework tailored for agents. TRACE decomposes target tasks into multiple subtask paths, selects the least harmful path, and camouflages remaining harmful subtasks within benign contexts. By integrating role and environment modeling with Q-learning-inspired action sampling and an iterative self-evolution mechanism, TRACE dynamically optimizes its attack strategy. Experiments demonstrate that TRACE achieves a 100% bypass rate on AgentHarm and an average success rate of 0.73 on AdvCUA, with successful replication in real-world cyberattack scenarios.
📝 Abstract
The rise of LLM agents introduces a new threat by enabling planning, coding, and even end-to-end execution of expert-level attack workflows. However, this threat remains underexplored and underestimated since (i) safety alignment prevents LLMs from directly generating harmful instructions, and (ii) most existing jailbreak methods cannot consistently induce agents to execute malicious operations. In this paper, we propose TRACE, a practical agentic jailbreaking framework to further reveal the risks of this threat surface. To conceal the malicious intent, TRACE decomposes a malicious task into multiple subtask sequences under different schemes and selects the sequence with the fewest explicitly harmful subtasks. TRACE then disguises the remaining harmful subtasks as benign-looking instructions by embedding them in task-aware scenarios with related roles, environments, directives, and heuristics. The scenarios are iteratively evolved through well-defined transformation actions, which are sampled by a Q-learning-inspired mechanism, for inducing the agent to execute on the harmful subtasks. Extensive evaluations on AgentHarm and AdvCUA show that TRACE consistently outperforms existing jailbreak baselines across multiple advanced LLM agents, achieving up to 100% bypass rate and 0.73 average success score. We also demonstrate the effectiveness of TRACE in controlled cyberattack instances. Our code and demos are available at https://github.com/ZJU-LLM-Safety/TRACE.git.