Triaging Threats to Specialized Guardrails

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
Current safety guardrails for large language models struggle to generalize across diverse threat types due to heterogeneous threats, fragmented data, and inconsistent evaluation standards. To address this, this work proposes RouteGuard, a novel framework that, for the first time, identifies the task interference problem inherent in monolithic guardrails and introduces GuardZoo, a unified benchmark for comprehensive evaluation. RouteGuard employs a modular routing mechanism to direct inputs to domain-specific expert classifiers, enabling fine-grained and scalable threat detection. Experimental results demonstrate that RouteGuard significantly outperforms strong baselines in both fine-grained threat identification and out-of-domain generalization, while also supporting flexible integration of emerging threat categories.
📝 Abstract
Building robust safety guardrails is essential for deploying Large Language Models across diverse real-world applications. However, this goal remains challenging because safety risks span heterogeneous threat domains, while existing datasets cover only fragmented risk subsets and rely on inconsistent taxonomies. Consequently, it remains unclear whether current guardrails can generalize beyond narrow evaluation settings. To better understand the robustness of guardrail models, we first introduce GuardZoo, a unified human-annotated benchmark with 32,460 samples covering 15 distinct unsafe categories. Evaluation on GuardZoo reveals that monolithic guardrails suffer from task interference: different threat domains require distinct decision boundaries that are difficult to compress into a single model. We therefore propose RouteGuard, a router-expert framework that triages each conversation to specialized expert guardrails for threat-specific detection. Experiments show that RouteGuard improves fine-grained threat detection over strong guardrail baselines, generalizes better under out-of-domain evaluation, and supports flexible modular expansion to emerging threats.
Problem

Research questions and friction points this paper is trying to address.

safety guardrails
threat domains
generalization
robustness
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

RouteGuard
GuardZoo
safety guardrails
router-expert framework
threat-specific detection
🔎 Similar Papers
No similar papers found.