Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning aligned large language models (LLMs) with ostensibly benign downstream data can inadvertently degrade alignment due to latent safety-weakening features, thereby increasing vulnerability to adversarial attacks. To address this, we propose Layer-Aware Representation Filtering (LARF), the first method that identifies fine-grained unsafe samples by analyzing activation representations in internally identified safety-sensitive layers. LARF introduces a safety-sensitive layer probing mechanism to quantify each training sample’s potential harm to alignment fidelity, enabling targeted filtering of high-risk instances. Experiments demonstrate that LARF accurately removes “benign-but-harmful” samples, significantly mitigating safety degradation induced by fine-tuning across multiple benchmarks. The approach enhances both model robustness against adversarial manipulation and alignment stability during adaptation.

Technology Category

Application Category

📝 Abstract
With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a extbf{L}ayer- extbf{A}ware extbf{R}epresentation extbf{F}iltering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at href{https://github.com/LLLeoLi/LARF}{https://github.com/LLLeoLi/LARF}.
Problem

Research questions and friction points this paper is trying to address.

Identifying safety-degrading features in fine-tuning datasets
Preventing LLM safety alignment degradation during fine-tuning
Filtering harmful data samples to preserve model safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-Aware Representation Filtering method
Identifies safety-sensitive LLM layers
Filters safety-degrading data samples
🔎 Similar Papers
No similar papers found.
H
Hao Li
Shanghai Artificial Intelligence Laboratory, Institute of Artificial Intelligence, Beihang University
L
Lijun Li
Shanghai Artificial Intelligence Laboratory
Z
Zhenghao Lu
Shanghai Artificial Intelligence Laboratory
X
Xianyi Wei
Shanghai Artificial Intelligence Laboratory, School of Computer Science, Wuhan University
R
Rui Li
School of Computer Science, Peking University
Jing Shao
Jing Shao
Research Scientist, Shanghai AI Laboratory/Shanghai Jiao Tong University
Computer VisionMulti-Modal Large Language Model
Lei Sha
Lei Sha
Prof@Beihang University, Prof@ZGC Lab, Oxtium AI, University of Oxford
NLPML