Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

📅 2024-08-18

🏛️ arXiv.org

📈 Citations: 35

✨ Influential: 1

career value

182K/year

🤖 AI Summary

Harmful fine-tuning attacks severely compromise the safety alignment of large language models (LLMs), yet existing defenses rely heavily on attack-specific hyperparameters (e.g., learning rate, epochs), limiting robustness. Method: We propose a fully hyperparameter-agnostic post-hoc safety repair mechanism. It first evaluates parameter importance, then applies a single round of sparse structured pruning to precisely identify and remove weights responsible for harmful outputs. Crucially, it integrates harmful-behavior attribution analysis with joint safety–utility optimization—requiring neither access to the original fine-tuning data nor knowledge of the training process. Contribution/Results: Extensive experiments across diverse attack settings and mainstream LLMs demonstrate that our method reduces harmful content generation scores by an average of 72%, while incurring negligible downstream task accuracy degradation (<0.5%). It exhibits strong generalization and practicality without hyperparameter tuning.

Technology Category

Application Category

📝 Abstract

Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions cite{huang2024vaccine, rosati2024representation} and fine-tuning stage solutions cite{huang2024lazy,mukhoti2023fine}. However, our evaluation shows that both categories of defenses fail extit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense, which however, is necessary to guarantee finetune performance. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains extbf{ extit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.Our project page is at url{https://huangtiansheng.github.io/Antidote_gh_page/}

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against harmful fine-tuning attacks

Addressing failure of existing defenses with specific hyper-parameters

Removing harmful parameters while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-fine-tuning safety alignment method

One-shot pruning to remove harmful weights

Hyper-parameter agnostic defense against harmful attacks

🔎 Similar Papers

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation