Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing large language model (LLM) fine-tuning methods often degrade safety performance—even when trained exclusively on harmless data—due to unintended amplification of unsafe behaviors during gradient updates. Method: This paper introduces the Safety-Aware Probing (SAP) framework, the first approach to embed a differentiable safety probe directly into the gradient propagation path, enabling dynamic identification and suppression of hazardous gradient directions. SAP integrates a lightweight, plug-and-play architecture with differentiable safety-constrained optimization, ensuring full compatibility with standard fine-tuning pipelines. Contribution: SAP achieves substantial reductions in harmful output rates across multiple safety benchmarks (average decrease of 32.7%), consistently outperforming baseline fine-tuned models on comprehensive safety metrics. Crucially, it preserves task utility—test loss remains comparable to standard fine-tuning—demonstrating that safety enhancement and functional performance can be jointly optimized without trade-off.

Technology Category

Application Category

📝 Abstract

The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.

Problem

Research questions and friction points this paper is trying to address.

Mitigating safety risks in fine-tuned LLMs

Preventing harmful content generation post fine-tuning

Balancing task performance and model safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety-aware probing optimization framework

Gradient propagation with safety-aware probe

Preserves model safety during fine-tuning

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security