🤖 AI Summary
This work addresses the vulnerability of open-weight large language models to safety alignment breaches caused by minimal malicious supervised fine-tuning (SFT). To counter this threat, the authors propose Patcher, a novel defense framework that, for the first time, integrates training-time adversarial attacks with extended optimization steps to robustify models against full-parameter malicious fine-tuning. Patcher employs a bilevel optimization formulation to enhance robustness while introducing an efficient parallel algorithm that substantially reduces computational overhead without compromising model performance. Extensive experiments demonstrate that, compared to standard SFT alignment, Patcher consistently achieves superior generalization robustness and transferability across diverse attack scenarios and model scales.
📝 Abstract
Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning. In this paper, we propose Patcher, a method inspired by adversarial training and bi-level optimization, to combat such attacks. Patcher strengthens the simulated attack by scaling up the optimization steps in the adversarial loop, thus forcing the defender to find model parameters that are insensitive to stronger attacks. Furthermore, we propose an efficient parallel algorithm to implement Patcher, decreasing the wall-clock time of training while preserving Patcher's performance. Extensive experiments show that Patcher substantially improves the model's robustness compared to vanilla SFT alignment, and transfers to diverse attack scenarios and model sizes. Code is available at https://github.com/haomingwen/patcher.