🤖 AI Summary
This work systematically investigates how model merging (MM) affects the transferability of adversarial examples, challenging the prevailing belief that “MM provides free adversarial robustness.” Through empirical evaluation across 8 merging methods, 7 datasets, and 6 attack types—yielding 336 distinct attack configurations—we find that MM does not improve black-box robustness; instead, it consistently elevates relative transfer attack success rates above 95%. We are the first to reveal that MM amplifies cross-model adversarial transferability by exacerbating representation bias and enhancing gradient alignment across merged models. We further identify three critical risk patterns stemming from this phenomenon. Crucially, we demonstrate that mainstream merging strategies—including simple weight averaging and state-of-the-art fusion techniques—fail to mitigate this vulnerability and often worsen it. Our findings provide both theoretical insights and practical evidence to inform the secure design of AI systems relying on model merging.
📝 Abstract
Model Merging (MM) has emerged as a promising alternative to multi-task learning, where multiple fine-tuned models are combined, without access to tasks' training data, into a single model that maintains performance across tasks. Recent works have explored the impact of MM on adversarial attacks, particularly backdoor attacks. However, none of them have sufficiently explored its impact on transfer attacks using adversarial examples, i.e., a black-box adversarial attack where examples generated for a surrogate model successfully mislead a target model.
In this work, we study the effect of MM on the transferability of adversarial examples. We perform comprehensive evaluations and statistical analysis consisting of 8 MM methods, 7 datasets, and 6 attack methods, sweeping over 336 distinct attack settings. Through it, we first challenge the prevailing notion of MM conferring free adversarial robustness, and show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Moreover, we reveal 3 key insights for machine-learning practitioners regarding MM and transferability for a robust system design: (1) stronger MM methods increase vulnerability to transfer attacks; (2) mitigating representation bias increases vulnerability to transfer attacks; and (3) weight averaging, despite being the weakest MM method, is the most vulnerable MM method to transfer attacks. Finally, we analyze the underlying reasons for this increased vulnerability, and provide potential solutions to the problem. Our findings offer critical insights for designing more secure systems employing MM.