Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

Existing research on backdoor attacks against large language models (LLMs) lacks interpretable mechanistic analysis, hindering effective detection and defense. Method: This work pioneers the application of mechanistic interpretability to LLM backdoor analysis, proposing the BkdAttr tripartite causal framework and introducing Backdoor Probe and BAHA—two novel methods that systematically uncover how backdoor features are encoded across attention heads and representation spaces. Contribution/Results: Through causal attribution, we identify a sparse set of critical attention heads (≈3% of total), enabling precise intervention: removing them reduces attack success rate by over 90%; injecting backdoor vectors into clean inputs triggers near-perfect attack activation (≈100%), while suppressing it entirely on trigger-laden inputs (≈0%). This study fills a fundamental gap in LLM backdoor interpretability and establishes a new paradigm for controllable, verifiable backdoor defense.

Technology Category

Application Category

📝 Abstract

Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal extbf{$sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by extbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only extbf{1-point} intervention on extbf{single} representation, the vector can either boost ASR up to extbf{$sim$ 100% ($uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to extbf{$sim$ 0% ($downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

Problem

Research questions and friction points this paper is trying to address.

Investigates interpretable mechanisms of backdoor attacks in fine-tuned LLMs

Identifies sparse attention heads responsible for processing backdoor features

Develops methods to control backdoors through targeted interventions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proving existence of learnable backdoor features

Pinpointing attention heads processing backdoor triggers

Controlling backdoor via single vector intervention

🔎 Similar Papers

Rethinking Backdoor Detection Evaluation for Language Models