Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) frequently exhibit over-refusal—rejecting legitimate user requests due to overly conservative safety alignment. This paper systematically analyzes the root cause from the perspective of safety decision boundaries and proposes RASS, a novel framework that mitigates over-refusal via safety boundary probing and representation-space steering vectors to automatically generate and filter boundary-proximal prompts. Our contributions are threefold: (1) MORBench, the first multilingual benchmark for evaluating over-refusal; (2) a boundary-driven, interpretable, and generalizable prompt optimization paradigm; and (3) support for joint multilingual assessment of safety and helpfulness. Experiments across mainstream LLMs demonstrate that RASS significantly reduces false refusal rates (−32.4%) and improves helpfulness (+18.7%), while preserving—or even enhancing—safety performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries-a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models'safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios.We have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets will be released at https://anonymous.4open.science/r/RASS-80D3.

Problem

Research questions and friction points this paper is trying to address.

Analyzing overrefusal in LLMs via safety decision boundaries

Mitigating overrefusal using boundary-aligned prompt generation

Evaluating model safety and helpfulness across multilingual scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes safety decision boundaries for overrefusal analysis

Uses RASS framework for targeted prompt generation

Leverages steering vectors in representation space

🔎 Similar Papers

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

2024-06-20arXiv.orgCitations: 36

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

2024-07-12arXiv.orgCitations: 23

Authors to Follow