RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses content poisoning attacks against black-box retrieval-augmented generation (RAG) question-answering systems. We propose RIPRAG, the first attack framework that operates without access to internal system components—relying solely on final output feedback. RIPRAG employs end-to-end reinforcement learning to optimize the generation of malicious documents that steer large language models (LLMs) toward attacker-preferred responses. Its key contributions are: (1) the first effective poisoning attack against multi-stage RAG systems under a fully black-box setting—where both the retrieval mechanism and RAG architecture are unknown; and (2) an adaptive attack paradigm guided by sparse success signals, eliminating reliance on gradients or intermediate outputs. Experiments across diverse, complex RAG systems demonstrate that RIPRAG achieves up to 0.72 higher attack success rate than state-of-the-art baselines, revealing critical vulnerabilities in existing defenses.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. However, by injecting poisoned documents into the database of RAG systems, attackers can manipulate LLMs to generate text that aligns with their intended preferences. Existing research has primarily focused on white-box attacks against simplified RAG architectures. In this paper, we investigate a more complex and realistic scenario: the attacker lacks knowledge of the RAG system's internal composition and implementation details, and the RAG system comprises components beyond a mere retriever. Specifically, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box, where the only information accessible to the attacker is whether the poisoning succeeds. Our method leverages Reinforcement Learning (RL) to optimize the generation model for poisoned documents, ensuring that the generated poisoned document aligns with the target RAG system's preferences. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.

Problem

Research questions and friction points this paper is trying to address.

Attacking black-box RAG systems through poisoned document injection

Manipulating LLMs to generate attacker-preferred text outputs

Improving poisoning attack success rates using reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Reinforcement Learning for black-box attacks

Optimizes poisoned documents to match system preferences

Achieves high attack success rates on complex RAG systems

🔎 Similar Papers

MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation