π€ AI Summary
This work addresses a critical gap in existing research on jailbreaking attacks against large language model (LLM) agents with tool use, which typically assumes continuously visible dialogue context and overlooks the loss of provenance in real-world deployments due to modular architectures and temporal fragmentation. We formally define this vulnerability as the βprovenance gapβ failure mode and introduce Context Fragmentation Decomposition (CFD), a novel attack paradigm that generates benign intermediate artifacts in early interactions and later combines innocuous tool operations to elicit harmful behavior, thereby evading current defenses. To counter such threats, we propose cross-step tool-call trajectory analysis, intermediate artifact manipulation detection, and fine-grained provenance tracking, along with a verifiable lineage-marking mitigation strategy. Experiments demonstrate that CFD achieves up to a 28.3 percentage point improvement over state-of-the-art methods on multiple jailbreaking benchmarks and remains highly effective even against strong single-turn content moderators.
π Abstract
Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text. Yet most existing attacks and defenses, including ``multi-turn'' jailbreaks such as Crescendo and Tree of Attacks,still assume a single contiguous conversation visible to the defender. This assumption breaks down in real agent pipelines, where enforcement is fragmented across tools, modules, and time, and where artifact provenance is often not tracked. We operationalize a deployment failure mode for tool-using LLM agents, the \emph{provenance gap}, and study reproducible triggers for it: \emph{Context-Fractured Decomposition} (CFD), a family of cross-context multi-step jailbreaks that preserve benign-looking intermediate artifacts from an early interaction and elicit harmful behavior much later, potentially in a different agent instance or workflow stage, via individually innocuous tool actions whose risk emerges only under delayed artifact-mediated composition. We instrument the failure mode with trace-level diagnostics and outline a verifiable mitigation direction (provenance lineage tagging). Across agent-system jailbreak benchmarks, CFD improves success rates by up to 28.3 percentage points over state-of-the-art baselines, even against strong single-turn judges. Disclaimer: This paper contains examples of harmful or offensive language.