🤖 AI Summary
Harmful fine-tuning attacks severely compromise the safety alignment of large language models (LLMs), yet existing defenses rely heavily on attack-specific hyperparameters (e.g., learning rate, epochs), limiting robustness.
Method: We propose a fully hyperparameter-agnostic post-hoc safety repair mechanism. It first evaluates parameter importance, then applies a single round of sparse structured pruning to precisely identify and remove weights responsible for harmful outputs. Crucially, it integrates harmful-behavior attribution analysis with joint safety–utility optimization—requiring neither access to the original fine-tuning data nor knowledge of the training process.
Contribution/Results: Extensive experiments across diverse attack settings and mainstream LLMs demonstrate that our method reduces harmful content generation scores by an average of 72%, while incurring negligible downstream task accuracy degradation (<0.5%). It exhibits strong generalization and practicality without hyperparameter tuning.
📝 Abstract
Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions cite{huang2024vaccine, rosati2024representation} and fine-tuning stage solutions cite{huang2024lazy,mukhoti2023fine}. However, our evaluation shows that both categories of defenses fail extit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense, which however, is necessary to guarantee finetune performance. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains extbf{ extit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.Our project page is at url{https://huangtiansheng.github.io/Antidote_gh_page/}