TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing safety alignment mechanisms for large language models (LLMs) are vulnerable to jailbreaking, while current jailbreak methods suffer from heavy reliance on manual effort, high computational cost, or degradation of model utility. Method: This paper proposes TwinPrune, a lightweight safety mechanism disentanglement method based on twin prompts—semantically and structurally similar prompt pairs. It performs contrastive analysis on intermediate-layer representations to precisely identify and prune embedded safety “backdoor” parameters. The approach integrates hierarchical parameter importance estimation, intermediate representation divergence detection, and lightweight fine-tuning. Contribution/Results: We introduce TwinPrompt, the first twin-prompt dataset (100 pairs), enabling fine-grained, low-intrusion alignment decoupling. Evaluated on 16 mainstream LLMs across five vendors, TwinPrune achieves 89%–98% jailbreak success rates with negligible computational overhead and no statistically significant degradation in standard task performance.

Technology Category

Application Category

📝 Abstract

Machine learning is advancing rapidly, with applications bringing notable benefits, such as improvements in translation and code generation. Models like ChatGPT, powered by Large Language Models (LLMs), are increasingly integrated into daily life. However, alongside these benefits, LLMs also introduce social risks. Malicious users can exploit LLMs by submitting harmful prompts, such as requesting instructions for illegal activities. To mitigate this, models often include a security mechanism that automatically rejects such harmful prompts. However, they can be bypassed through LLM jailbreaks. Current jailbreaks often require significant manual effort, high computational costs, or result in excessive model modifications that may degrade regular utility. We introduce TwinBreak, an innovative safety alignment removal method. Building on the idea that the safety mechanism operates like an embedded backdoor, TwinBreak identifies and prunes parameters responsible for this functionality. By focusing on the most relevant model layers, TwinBreak performs fine-grained analysis of parameters essential to model utility and safety. TwinBreak is the first method to analyze intermediate outputs from prompts with high structural and content similarity to isolate safety parameters. We present the TwinPrompt dataset containing 100 such twin prompts. Experiments confirm TwinBreak's effectiveness, achieving 89% to 98% success rates with minimal computational requirements across 16 LLMs from five vendors.

Problem

Research questions and friction points this paper is trying to address.

Identifies and prunes safety parameters in LLMs

Bypasses security with minimal computational cost

Uses twin prompts to isolate safety mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes safety parameters via fine-grained analysis

Uses twin prompts to isolate safety mechanisms

Achieves high success rates with low cost

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance