🤖 AI Summary
This work identifies that the self-correction ability (SCA) of large language models (LLMs) in retrieval-augmented generation (RAG) systems can effectively mitigate knowledge base poisoning attacks—but only when the retriever outputs reliable results. To exploit this dependency, the authors propose the first *retriever poisoning attack paradigm*: leveraging contrastive learning–based local model editing and iterative co-optimization to stealthily manipulate retriever behavior, thereby suppressing LLMs’ SCA and evading their intrinsic defenses. The method automatically generates robust adversarial queries that enable precise output control over target questions. Evaluated across six LLMs and three open-domain QA benchmarks, the attack achieves over 90% success rate. Crucially, the poisoned retriever remains highly evasive under multiple detection methods—including retrieval fidelity analysis, embedding distribution testing, and behavioral probing—demonstrating both practical efficacy and stealth. This study establishes a novel threat model for RAG security and provides empirical grounding for future defense research.
📝 Abstract
Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong extit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems.
In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce extsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses.