Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

๐Ÿ“… 2025-08-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) are vulnerable to backdoor attacks, particularly against advanced variants such as model editing, multi-trigger, and triggerless attacksโ€”where existing defenses lack robustness. To address this, we propose *knowledge dilution*, a novel paradigm that jointly attenuates internal backdoor memorization via parameter-level memory dilution and external backdoor activation via prompt-based attention dispersion. Our method employs lightweight fine-tuning on clean data, model fusion, and injection of semantically relevant external evidence to interfere with and dilute backdoor representations. Evaluated across five mainstream LLMs, it reduces attack success rates by up to 98% compared to eight state-of-the-art defenses, preserves original model performance, incurs minimal computational overhead, and demonstrates strong resilience against adaptive attacks. The core contribution is a generalizable, low-cost, and highly compatible knowledge dilution framework that significantly extends the practical applicability boundary of backdoor defense in LLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model's parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM's attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.
Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against diverse backdoor attacks
Reducing attack success rates while preserving utility
Addressing both internal and external backdoor mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Internal knowledge dilution via clean model merging
External prompt enrichment with benign evidence
Lightweight dataset training for cost-efficient purification
๐Ÿ”Ž Similar Papers
No similar papers found.