TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large foundation models are vulnerable to adversarial attacks, and existing defense methods based on local neuron suppression struggle to block the distributed propagation of harmful semantics while often degrading the model’s general capabilities. This work proposes TraceRouter, a novel framework that transcends conventional local intervention assumptions by precisely identifying and physically severing the causal pathways through which harmful semantics propagate at the path level. TraceRouter employs a three-stage pipeline—sensitive layer localization, malicious feature disentanglement, and downstream causal path mapping—to enable selective intervention that preserves normal computational pathways. By integrating attention divergence analysis, sparse autoencoders, differential activation analysis, and a Feature Impact Score (FIS) based on nullification-based intervention, TraceRouter significantly outperforms existing approaches in multiple experiments, achieving superior adversarial robustness while better maintaining model utility.

Technology Category

Application Category

📝 Abstract
Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the"locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.
Problem

Research questions and friction points this paper is trying to address.

large foundation models
adversarial manipulation
harmful semantics
localized interventions
distributed circuits
Innovation

Methods, ideas, or system contributions that make the work stand out.

path-level intervention
causal propagation circuits
sparse autoencoders
feature influence scores
adversarial robustness
🔎 Similar Papers
No similar papers found.