Steering Over-refusals Towards Safety in Retrieval Augmented Generation

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This work addresses the over-refusal problem in retrieval-augmented generation (RAG) induced by safety alignment, identifying query intent ambiguity and context contamination—particularly from cross-domain or harmful retrieved fragments—as key drivers of erroneous rejections of benign queries. To systematically quantify these factors, we introduce RagRefuse, the first domain-hierarchical benchmark for refusal behavior in RAG. We further propose SafeRAG-Steering, a model-centric, inference-time embedding intervention method that steers input embeddings toward safe, non-refusing regions via targeted offset operations, enabling fine-grained control over refusal decisions. Experiments demonstrate that SafeRAG-Steering maintains high rejection rates for genuinely harmful queries while significantly reducing over-refusal on benign ones. To our knowledge, this is the first approach to achieve joint optimization of safety and usability in RAG systems.

Technology Category

Application Category

📝 Abstract

Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce extsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.

Problem

Research questions and friction points this paper is trying to address.

Analyzing refusal behavior in retrieval-augmented generation systems

Addressing over-refusals caused by context contamination and safety filters

Mitigating excessive refusals while maintaining legitimate safety responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embedding intervention steers outputs to safe regions

Reduces over-refusals while preserving legitimate refusals

Model-centric approach applied during inference time

🔎 Similar Papers

No similar papers found.