Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of backdoor attacks on vision-language agents focus solely on attack success rates, overlooking unintended activation of triggers on non-target inputs—referred to as “trigger leakage”—which leads to an underestimation of security risks. This work formally defines this issue and introduces the Neighbor Leakage Rate (NLR) to quantify the extent to which semantically or visually proximate inputs elicit malicious behavior, revealing that standard fine-tuning often yields overly broad activation regions. To address this, the authors propose a training strategy incorporating hard negative samples with an edit distance of one, effectively narrowing the activation region and enhancing trigger specificity. Experiments show that, under a 3% poisoning rate, icon- and text-based triggers remain robust to common transformations yet exhibit high NLRs of 0.996 and 0.944, respectively; the proposed method significantly reduces leakage and suppresses false activations in both image editing and embodied manipulation tasks.
📝 Abstract
Vision-Language Agentic Systems (VLAS) connect visual perception to planning, tool use, and physical actions. This means backdoor-type triggers can propagate through both decision pipelines and their connected interfaces, thus making visual backdoors a system-level threat. Current evaluations on such backdoors focus on clean accuracy and attack success rate (ASR), metrics that capture whether a trigger works, but not whether an attack is actually "precise" -- i.e. whether it triggers hidden behaviors only when intended. In this work, we formalize the failure of trigger precision as "trigger leakage": inputs that are visually or semantically close to the intended trigger and therefore inadvertently activate the attacker-specified behavior. To quantify this leakage, we introduce Neighbor Leakage Rate (NLR). Our experiments show that at a 3% poisoning ratio, icon and text triggers remain robust to common visual transformations, but their neighboring variants leak heavily, with NLR reaching 0.996 (icon) and 0.944 (text). Using textual triggers as a controlled probe, we show that standard fine-tuning learns a broad activation region rather than an exact trigger condition, causing neighboring strings to invoke the malicious behavior even when the exact trigger is absent. Adding edit-distance-one hard-negative samples during training substantially narrows this activation region and reduces leakage, including in image-editing and embodied-manipulation workflows, where leaked triggers can propagate into executable programs and action sequences.
Problem

Research questions and friction points this paper is trying to address.

trigger leakage
backdoor attacks
Vision-Language Agentic Systems
attack precision
Neighbor Leakage Rate
Innovation

Methods, ideas, or system contributions that make the work stand out.

trigger leakage
Neighbor Leakage Rate
vision-language agentic systems
backdoor precision
hard-negative training
🔎 Similar Papers
No similar papers found.