SafeGene: Reusable Adapters for Transferable Safety Alignment

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of open-source large language models (LLMs) to malicious prompts—even on harmless inputs—caused by the degradation of safety alignment during downstream fine-tuning. To mitigate this, the authors propose SafeGene, a reusable safety adapter module that, for the first time, explicitly models safety alignment as a task-agnostic, standalone representation. SafeGene extracts safety vectors by contrasting aligned and degraded models, and integrates a data-aware layer selection strategy with few-shot, layer-wise coefficient calibration. Evaluated across multiple model families, downstream tasks, and safety benchmarks, SafeGene significantly reduces harmful response rates while preserving task performance, outperforming existing safety adaptation methods.
📝 Abstract
Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.
Problem

Research questions and friction points this paper is trying to address.

safety alignment
downstream fine-tuning
harmful prompts
safe adaptation
model safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

safety alignment
reusable adapter
transferable safety
model fine-tuning
layer-wise recalibration
🔎 Similar Papers