🤖 AI Summary
Current safety guardrails for large language models struggle to generalize across diverse threat types due to heterogeneous threats, fragmented data, and inconsistent evaluation standards. To address this, this work proposes RouteGuard, a novel framework that, for the first time, identifies the task interference problem inherent in monolithic guardrails and introduces GuardZoo, a unified benchmark for comprehensive evaluation. RouteGuard employs a modular routing mechanism to direct inputs to domain-specific expert classifiers, enabling fine-grained and scalable threat detection. Experimental results demonstrate that RouteGuard significantly outperforms strong baselines in both fine-grained threat identification and out-of-domain generalization, while also supporting flexible integration of emerging threat categories.
📝 Abstract
Building robust safety guardrails is essential for deploying Large Language Models across diverse real-world applications. However, this goal remains challenging because safety risks span heterogeneous threat domains, while existing datasets cover only fragmented risk subsets and rely on inconsistent taxonomies. Consequently, it remains unclear whether current guardrails can generalize beyond narrow evaluation settings. To better understand the robustness of guardrail models, we first introduce GuardZoo, a unified human-annotated benchmark with 32,460 samples covering 15 distinct unsafe categories. Evaluation on GuardZoo reveals that monolithic guardrails suffer from task interference: different threat domains require distinct decision boundaries that are difficult to compress into a single model. We therefore propose RouteGuard, a router-expert framework that triages each conversation to specialized expert guardrails for threat-specific detection. Experiments show that RouteGuard improves fine-grained threat detection over strong guardrail baselines, generalizes better under out-of-domain evaluation, and supports flexible modular expansion to emerging threats.