🤖 AI Summary
Existing large language model (LLM) fine-tuning methods often degrade safety performance—even when trained exclusively on harmless data—due to unintended amplification of unsafe behaviors during gradient updates.
Method: This paper introduces the Safety-Aware Probing (SAP) framework, the first approach to embed a differentiable safety probe directly into the gradient propagation path, enabling dynamic identification and suppression of hazardous gradient directions. SAP integrates a lightweight, plug-and-play architecture with differentiable safety-constrained optimization, ensuring full compatibility with standard fine-tuning pipelines.
Contribution: SAP achieves substantial reductions in harmful output rates across multiple safety benchmarks (average decrease of 32.7%), consistently outperforming baseline fine-tuned models on comprehensive safety metrics. Crucially, it preserves task utility—test loss remains comparable to standard fine-tuning—demonstrating that safety enhancement and functional performance can be jointly optimized without trade-off.
📝 Abstract
The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.