🤖 AI Summary
This work addresses the over-refusal problem in retrieval-augmented generation (RAG) induced by safety alignment, identifying query intent ambiguity and context contamination—particularly from cross-domain or harmful retrieved fragments—as key drivers of erroneous rejections of benign queries. To systematically quantify these factors, we introduce RagRefuse, the first domain-hierarchical benchmark for refusal behavior in RAG. We further propose SafeRAG-Steering, a model-centric, inference-time embedding intervention method that steers input embeddings toward safe, non-refusing regions via targeted offset operations, enabling fine-grained control over refusal decisions. Experiments demonstrate that SafeRAG-Steering maintains high rejection rates for genuinely harmful queries while significantly reducing over-refusal on benign ones. To our knowledge, this is the first approach to achieve joint optimization of safety and usability in RAG systems.
📝 Abstract
Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce extsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.