From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing model merging approaches struggle to simultaneously eliminate backdoors and preserve performance on clean tasks within the parameter space. This work proposes the Linear Feature Path Minimization (LFPM) framework, which establishes backdoor robustness from a unified feature-space perspective for the first time. Built upon the Cross-Task Linearity hypothesis, LFPM introduces an anti-backdoor task vector and integrates gradient accumulation with a loss path integral mechanism to stably suppress backdoor behaviors along interpolation paths. The method is compatible with both full fine-tuning and parameter-efficient fine-tuning (PEFT), significantly enhancing robustness against backdoor attacks while effectively maintaining performance on clean tasks.

📝 Abstract

Model merging (MM) has gained significant attention as a cost-effective approach to integrate multiple task-specific models into a unified model. However, recent work reveals that MM is highly susceptible to backdoor attacks. Existing defenses based on task arithmetic often fail to eliminate backdoors without substantially degrading clean-task performance, owing to their reliance on direct parameter-space editing. To address this gap, we propose Linear Feature Path Minimization (LFPM), a backdoor mitigation framework for model merging, which introduces an anti-backdoor task vector into the backdoored merged model. Unlike prior approaches, LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance. Furthermore, we introduce an effective optimization mechanism based on gradient accumulation and loss path-integral, ensuring robust backdoor suppression along the interpolation path. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings.

Problem

Research questions and friction points this paper is trying to address.

model merging

backdoor attacks

task arithmetic

feature space

clean-task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

model merging

backdoor mitigation

feature space