🤖 AI Summary
This work exposes a critical security vulnerability in text-to-video (T2V) models at the implicit semantic level: seemingly neutral prompts—containing rich cross-modal visual association cues—can bypass content safety filters and generate policy-violating, semantically unsafe videos. To address this, we propose the first modular prompt attack framework, integrating neutral scene anchors, latent auditory triggers, and style modulators to explicitly encode audio-visual co-occurrence priors and steer cross-modal associative generation. We further design constrained optimization and guided search strategies to efficiently discover highly stealthy adversarial prompts within the modular prompt space. Evaluated on seven mainstream T2V models—including multiple commercial systems—our approach achieves an average 23% improvement in attack success rate. This is the first systematic demonstration of T2V models’ susceptibility to implicit semantic attacks, revealing fundamental weaknesses in current safety mechanisms.
📝 Abstract
Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.