CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing selective forgetting methods fail to defend against malicious attacks that bypass safety alignment during fine-tuning of commercial large language models (LLMs), due to LLMs’ strong adaptability. Method: We propose an “active collapse” defense paradigm: embedding lightweight, conditionally triggered “collapse traps” into the model that—upon detecting persistent misalignment-inducing fine-tuning—automatically and irreversibly degrade language modeling capability, thereby fundamentally preventing execution of harmful tasks. The mechanism leverages real-time fine-tuning dynamics monitoring and pre-designed parameter collapse pathways. Contribution/Results: Our approach preserves benign fine-tuning functionality with negligible performance degradation (<0.5%), while reducing malicious task success rates to near zero. To the best of our knowledge, this is the first work to introduce controllable, irreversible model collapse as a security mechanism for fine-tuning services, significantly enhancing the robustness of LLM-based offerings.

Technology Category

Application Category

📝 Abstract
Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse--effectively forcing the model to"unlearn everything"--specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that attackers exploit, tackling the core issue unaddressed by selective unlearning. We introduce the Collapse Trap (CTRAP) as a practical mechanism to implement this concept conditionally. Embedded during alignment, CTRAP pre-configures the model's reaction to subsequent fine-tuning dynamics. If updates during fine-tuning constitute a persistent attempt to reverse safety alignment, the pre-configured trap triggers a progressive degradation of the model's core language modeling abilities, ultimately rendering it inert and useless for the attacker. Crucially, this collapse mechanism remains dormant during benign fine-tuning, ensuring the model's utility and general capabilities are preserved for legitimate users. Extensive empirical results demonstrate that CTRAP effectively counters harmful fine-tuning risks across various LLMs and attack settings, while maintaining high performance in benign scenarios. Our code is available at https://anonymous.4open.science/r/CTRAP.
Problem

Research questions and friction points this paper is trying to address.

Prevent harmful fine-tuning attacks on LLMs
Address limitations of selective unlearning defenses
Safeguard model capabilities during benign fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Induces model collapse to counter harmful fine-tuning
Embeds conditional collapse trap during alignment
Preserves utility during benign fine-tuning scenarios
🔎 Similar Papers
No similar papers found.
Biao Yi
Biao Yi
Nankai University
LLM SecurityTrustworthy LLMSteganography
Tiansheng Huang
Tiansheng Huang
Georgia Institute of Technology
Parallel and Distributed ComputingDistributed machine learningLLM safety
Baolei Zhang
Baolei Zhang
Nankai University
T
Tong Li
College of Cyber Science, Nankai University
L
Lihai Nie
College of Cyber Science, Nankai University
Z
Zheli Liu
College of Cyber Science, Nankai University
L
Li Shen
Shenzhen Campus of Sun Yat-sen University