🤖 AI Summary
Large language models (LLMs) frequently exhibit over-refusal—rejecting legitimate user requests due to overly conservative safety alignment. This paper systematically analyzes the root cause from the perspective of safety decision boundaries and proposes RASS, a novel framework that mitigates over-refusal via safety boundary probing and representation-space steering vectors to automatically generate and filter boundary-proximal prompts. Our contributions are threefold: (1) MORBench, the first multilingual benchmark for evaluating over-refusal; (2) a boundary-driven, interpretable, and generalizable prompt optimization paradigm; and (3) support for joint multilingual assessment of safety and helpfulness. Experiments across mainstream LLMs demonstrate that RASS significantly reduces false refusal rates (−32.4%) and improves helpfulness (+18.7%), while preserving—or even enhancing—safety performance.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries-a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models'safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios.We have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets will be released at https://anonymous.4open.science/r/RASS-80D3.