🤖 AI Summary
This work addresses the failure of jailbreaking attacks—generated on open-source large language models (LLMs)—to transfer effectively to closed-source LLMs, revealing that such failures stem from overfitting of adversarial sequences to the source model’s parameters, which undermines the generalizability of intent-aware perturbations. To mitigate this, we propose Perceived-importance Flatten (PiF): a method that jointly performs token-level importance rebalancing and neutral-token attention uniformization to obfuscate malicious intent representations and reduce reliance on overfitted adversarial sequences. PiF integrates intent-aware analysis with distributional modeling of adversarial sequences. Empirical evaluation demonstrates that PiF significantly improves cross-model jailbreak transfer success rates across major closed-source models—including GPT-4, Claude, and Gemini—enabling efficient and robust red-teaming assessments. Our approach establishes a novel paradigm for security evaluation of proprietary LLMs.
📝 Abstract
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. By incorporating adversarial sequences, these attacks can redirect the source LLM's focus away from malicious-intent tokens in the original input, thereby obstructing the model's intent recognition and eliciting harmful responses. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM's parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs.