🤖 AI Summary
Safety alignment mechanisms in large language models (LLMs) are vulnerable to sequential context manipulation attacks. Method: We propose SequentialBreak—a novel single-query jailbreaking technique that embeds malicious prompts within a chain of benign, multi-step instructions. It exploits LLMs’ attentional shifts across sequential prompts and their preference for semantic coherence to suppress awareness of harmful content and elicit policy-violating outputs. SequentialBreak requires no iterative optimization, complex encoding, or task-specific customization, and natively supports diverse narrative structures—including question-answering, dialogue completion, and interactive gaming—across both open- and closed-source LLMs. Contribution/Results: Experiments demonstrate that SequentialBreak significantly outperforms existing baselines under single-query constraints. It is the first work to systematically expose a fundamental vulnerability in current safety defenses: their inadequate modeling of dynamic sequential context. Our findings underscore the urgent need for robust, context-aware defense mechanisms capable of tracking and evaluating evolving prompt semantics across multi-turn interactions.
📝 Abstract
As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: https://anonymous.4open.science/r/JailBreakAttack-4F3B/.