GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to safety degradation during fine-tuning, particularly when original safety alignment data is unavailable. To mitigate this issue, the authors propose GR-SAP, a novel framework that introduces generative replay into safety alignment preservation. GR-SAP leverages the LLM itself to generate domain-specific safety-aligned data, which is then used in conjunction with downstream task data to jointly optimize both task performance and safety objectives during fine-tuning. Experimental results across multiple models and tasks demonstrate that GR-SAP effectively alleviates safety degradation induced by fine-tuning while maintaining strong downstream task performance.

Technology Category

Application Category

📝 Abstract

Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

fine-tuning

large language models

alignment preservation

generative replay

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Replay

Safety Alignment

Fine-tuning

Synthetic Data

Large Language Models

🔎 Similar Papers

No similar papers found.