🤖 AI Summary
Existing multi-turn jailbreaking attacks designed for large language models often fail when transferred to vision-language models (LVLMs), as they are readily intercepted by safety alignment mechanisms and struggle to elicit harmful outputs. To address this limitation, this work proposes MAPA, a novel dual-level adaptive attack framework that alternates between textual and visual adversarial actions within each turn while dynamically optimizing the attack trajectory across multiple turns to progressively amplify maliciousness in model responses. By integrating multi-turn adaptive prompting with a joint text-vision strategy, MAPA substantially enhances attack efficacy. Experiments on mainstream LVLMs—including LLaVA-V1.6-Mistral-7B and Qwen2.5-VL-7B-Instruct—demonstrate attack success rates 11%–35% higher than those of current state-of-the-art methods.
📝 Abstract
Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.