SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning large language models (LLMs) often leads to unintended degradation of safety alignment, even when using benign data. To address this, we propose a post-fine-tuning layer-selective fusion framework that identifies layers deviating from safe behavior via per-layer cosine similarity assessment and triggers hierarchical weight interpolation between safety- and task-specialized models in a subspace-guided, on-demand manner. Our approach introduces the first similarity-based criterion for deviation detection and selective fusion, enabling dynamic, multi-strategy scheduling. Experiments on Llama-2-7B-Chat and Qwen-2-7B-Instruct demonstrate substantial reductions in harmful outputs while preserving or improving accuracy on GSM8K and PubMedQA—achieving consistently superior safety performance compared to existing post-hoc safety methods.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) on downstream tasks can inadvertently erode their safety alignment, even for benign fine-tuning datasets. We address this challenge by proposing SafeMERGE, a post-fine-tuning framework that preserves safety while maintaining task utility. It achieves this by selectively merging fine-tuned and safety-aligned model layers only when those deviate from safe behavior, measured by a cosine similarity criterion. We evaluate SafeMERGE against other fine-tuning- and post-fine-tuning-stage approaches for Llama-2-7B-Chat and Qwen-2-7B-Instruct models on GSM8K and PubMedQA tasks while exploring different merging strategies. We find that SafeMERGE consistently reduces harmful outputs compared to other baselines without significantly sacrificing performance, sometimes even enhancing it. The results suggest that our selective, subspace-guided, and per-layer merging method provides an effective safeguard against the inadvertent loss of safety in fine-tuned LLMs while outperforming simpler post-fine-tuning-stage defenses.
Problem

Research questions and friction points this paper is trying to address.

Preserving safety alignment in fine-tuned LLMs
Selective merging of fine-tuned and safety-aligned layers
Reducing harmful outputs without sacrificing performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective layer-wise model merging for safety
Cosine similarity measures layer deviation
Subspace-guided per-layer merging strategy
🔎 Similar Papers
No similar papers found.
A
Aladin Djuhera
Technical University Munich, Chair of Theoretical Information Technology
S
S. Kadhe
IBM Research
F
Farhan Ahmed
IBM Research
Syed Zawad
Syed Zawad
Research Scientist, IBM
Machine LearningDistributed SystemsCloud ComputingFederated Learning
Holger Boche
Holger Boche
Technische Universität München
Information TheorySignal ProcessingCommunication Theory