🤖 AI Summary
This work addresses the over-refusal problem in large language models, where safety alignment mechanisms often lead to unwarranted rejections of harmless queries. The authors propose DDOR, a novel framework that, for the first time under black-box settings, integrates incremental debugging into over-refusal analysis to precisely identify the minimal input segment triggering refusal. Leveraging these segments, DDOR generates diverse, context-rich prompts and employs multi-oracle validation to construct an interpretable over-refusal test suite—comprising approximately 1,000 high-quality cases per model—for targeted mitigation. Experimental results demonstrate that this approach significantly reduces over-refusal rates while preserving robust defenses against genuinely harmful content, thereby achieving a balanced trade-off between model usability and safety.
📝 Abstract
While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.